BitTorrent is a protocol used for distributing files. It identifies content through URLs and is designed to seamlessly integrate with the web. Its advantage over simple HTTP downloads is that when multiple users are downloading the same file simultaneously, these users can upload file blocks to each other, allowing multiple users to participate in file transfer instead of relying on a single file server. This allows the file source to support a large number of simultaneous downloads with minimal impact on its own load.
BitTorrent file distribution consists of the following entities:
- A regular web server
- A static metainfo file
- A BitTorrent tracker
- An original downloader
- A web browser
- A BT download client
- A file that can be shared by many end users for downloading in the most ideal situation.
To start the service, the host must perform the following steps:
- Start a tracker (or have one already running).
- Start a regular web server, such as Apache (or have one already running).
- Associate the .torrent extension with the MIME type application/x-bittorrent on their web server (or have already done so).
- Generate a metainfo (.torrent) file using the complete file to be provided and the URL of the tracker.
- Place the metainfo file on the web server.
- Link to the metainfo (.torrent) file from other web pages.
- Start a downloader that already has the complete source file.
To start downloading, the user performs the following actions:
- Install a BitTorrent client.
- Surf the internet.
- Open the link to the .torrent file.
- Choose the location to save the file locally or select which parts to continue downloading.
- Wait for the download to complete.
- Exit the download client (it will continue uploading until it receives an exit command).
Bencode encoding is used to serialize (or "serialize") values of any data type into byte streams for transmission over the network or storage on disk. Bencode encoding supports four basic data types: strings, integers, lists, and dictionaries.
- For string types, they are represented using a length prefix followed by a colon, for example, "spam" is encoded as 4.
- For integer types, they start with the character "i", followed by the decimal representation of the integer, and the character "e", for example, 3 is encoded as i3e, -3 is encoded as i-3e.
- For list types, they start with the character "l", followed by the Bencode encoding of each element in the list, and end with the character "e", for example, ['spam', 'eggs'] is encoded as l4:spam4.
- For dictionary types, they start with the character "d", followed by the Bencode encoding of key-value pairs sorted by the ASCII value of the keys, and end with the character "e", for example, {'cow': 'moo', 'spam': 'eggs'} is encoded as d3:cow3:moo4:spam4. It is important to note that all keys in dictionary types must be strings and sorted according to the original string's sorting order.
Bencode encoding provides a simple, compact, and readable data serialization scheme that is suitable for serialization and deserialization operations on various data types.
Metainfo files (also known as .torrent files) are dictionaries encoded using bencoding, and they contain the following keys:
announce
The URL address of the tracker.
info
This key maps to a dictionary that contains the keys described below.
All strings in the .torrent file containing text must be encoded using UTF-8.
The info dictionary contains metadata information used when downloading a BitTorrent file. It consists of multiple key-value pairs, where each key represents a specific attribute, such as the suggested file name to save, the size of file blocks, and hash values, among others. These attributes help the client download and assemble the file correctly.
Some important attributes include:
- name: The suggested file name to save, purely for suggestion purposes and not mandatory.
- piece length: The file is divided into fixed-size blocks, and this attribute specifies the size of each block (usually a power of 2).
- pieces: The hash values corresponding to each block, used for verifying the integrity of the file.
- length: The length of the file (only applicable for single-file cases) or the total length of all files concatenated.
If downloading a single file, only the name and length attributes are needed. If downloading multiple files, the files list is used to represent the directory structure of the files, where each file is represented by a dictionary containing length and path attributes.
Trackers' GET requests include the following keywords:
info_hash: The 20-byte SHA1 hash value of the info value in the metainfo file, in its bencoded form. This value almost certainly needs to be escaped. Note that this is a substring of the metainfo file. The info-hash must be the encoded form of the hash value in the .torrent file, which is exactly the same as decoding the metainfo file, extracting the info dictionary, and encoding it, if the bencoded encoding fully validates the input (e.g., key sorting, missing leading zeros). Conversely, this means that clients must reject invalid metainfo files or extract the substring directly. Decoding-encoding round trips on invalid data must not be performed.
peer_id: A 20-byte string used as the ID for this downloader. Each downloader generates its own ID randomly when starting a new download. This value almost certainly needs to be escaped.
ip: An optional parameter that specifies the IP address (or DNS name) of this peer. It is typically used for the source address, and if the source is on the same computer as the tracker, this parameter is used.
port: The port number that this peer is listening on. The common behavior is for the downloader to try listening on port 6881 and try ports 6882, 6883, etc., if that port is taken, and give up after 6889.
uploaded: The total amount uploaded so far, encoded in decimal ASCII.
downloaded: The total amount downloaded so far, encoded in decimal ASCII.
left: The number of bytes this peer still has to download, encoded in decimal ASCII. Note that this cannot be calculated from downloaded and the file length since it may be a resumed download, and it is possible that some downloaded data failed an integrity check and needs to be re-downloaded.
event: This is an optional keyword mapped to started, completed, or stopped (or empty, which is the same as not being present). If not present, it is one of the periodic announcements. A started announcement is sent when a download first starts; a completed announcement is sent when a download completes. If the file was complete when started, no completed is sent. A stopped announcement is sent when a downloader stops downloading.
The tracker response is a bencoded dictionary. If the tracker response has a failure reason keyword, it maps to a human-readable string explaining why the query failed and no other keywords are required. Otherwise, it must have two keywords: interval, which maps to the number of seconds the downloader should wait between regular re-requests, and peers, which maps to a dictionary of peer lists, with each dictionary containing the peer id, ip, and port keywords representing the self-selected ID, IP address or DNS name, and port number of the peer, respectively. Note that if a compressed representation of the peer list is returned by the tracker, see Tracker Returns Compact Peer Lists.
If you want to make any extensions to the metainfo file or tracker query, coordinate with Bram Cohen to ensure that all extensions are compatible.
Announcements can also be made using the UDP tracker protocol.
The peer protocol of BitTorrent operates over TCP or the uTorrent transport protocol. In BitTorrent, downloaders and uploaders are peers, and data can flow between them. Data is divided into many small pieces and downloaded from multiple sources simultaneously to improve download speed and transfer reliability. Whenever a downloader completes downloading a piece of the file and verifies that the hash value matches, it broadcasts the information about that piece to all peers.
There are two types of connection states: choked or unchoked; interested or not interested. Choked means no data will be sent until unchoked. Data transfer only happens when one side is interested and the other side is unchoked. To achieve correct data transfer, downloaders must maintain an interested state and update it promptly even when choked. When transferring data, downloaders should keep a queue of requests for multiple blocks to achieve good TCP performance, and requests that cannot be immediately written to the TCP buffer should be queued in memory to be discarded in case of a choke.
The peer protocol consists of a handshake process and endless length-prefixed messages. The handshake starts with a 19 (decimal) character followed by the string "BitTorrent protocol". Before connecting two nodes, they need to perform a handshake to exchange necessary information. At the beginning of the handshake, the sending party sends a fixed handshake message to the receiving party, which includes the protocol version, reserved bytes, metadata hash value, and node ID. The receiving party verifies this information and checks if they match the expected values. If there is any mismatch, the connection is terminated. After a successful handshake, both parties start exchanging data. Data is transmitted through a series of length-prefixed messages. The length prefix indicates the length of the message, followed by the actual message content.
All non-handshake messages start with a single byte, and below are their types.
Possible values are:
- 0 - choke
- 1 - unchoke
- 2 - interested
- 3 - not interested
- 4 - have
- 5 - bitfield
- 6 - request
- 7 - piece
- 8 - cancel
'choke', 'unchoke', 'interested', and 'not interested' have no payload.
bitfield message: This message is only sent as the first message. Its payload is a bitfield where each index sent by the downloader is set to 1 and the rest of the indices are set to 0. Downloaders with no data can skip this "bitfield" message. The first byte of the bitfield corresponds to indices 0-7 (from high bit to low bit), followed by 8-15, and so on. Any extra bits at the end are set to 0.
have message: The payload of this message is a single number representing the index that the downloader has completed and verified the hash value for.
request message: This message contains an index, offset, and length. The last parameter is usually a power of 2 unless it reaches the end of the file. All current implementations use 2^14 (16 kiB) and close connections that request beyond this limit.
cancel message: It has the same payload as the request message. They are usually only sent when downloading is finished, during the so-called "endgame mode". When a download is nearing completion, the last few pieces tend to be downloaded from a single modem line, which takes a long time. To ensure that the last few blocks enter quickly, once all parts of a given downloader's requests for parts that it hasn't received yet are in the pending state, it sends requests for the entire contents to everyone it's downloading from. To prevent this from becoming very inefficient, a cancel is sent for every piece that arrives.
piece message: It contains an index, a start, and a fragment. Note that they are implicitly associated with request messages. Unexpected fragments may occur if choke and unchoke messages are sent rapidly and/or the transfer rate is very slow.
First, downloaders typically download different parts of the file randomly to avoid situations where one downloader has a subset or superset of another downloader. Secondly, to ensure that each downloader gets a consistent download speed, operations to limit upload speed (also known as choking) are needed. This algorithm allows downloaders to use a "tit for tat" strategy among themselves to ensure a stable download speed. Finally, the current choking algorithm has been in use and emphasizes the importance of new algorithms performing well in different network environments.
First, to ensure good TCP performance, the algorithm needs to limit the number of connections that can upload simultaneously. Secondly, to avoid frequent choking and unchoking, the algorithm needs to select upload targets stably. Additionally, there should be some reward for those who let themselves be downloaded by others. Finally, the algorithm should attempt to utilize unused connections to improve download speed, which is known as optimistic unchoking.
The currently deployed choking algorithm takes some measures to achieve these goals. To avoid frequent changes, the algorithm only changes which peers are choked every 10 seconds. To reward others and limit the number of upload connections, the algorithm selects the top four peers with the best download rates and interest to unchoke. If there are other peers with better upload rates but no interest, they may also be unchoked if they suddenly show interest, and the peer with the worst download rate will be choked. If a downloader has completed the file, it will use the upload rate instead of the download rate to decide which peers to unchoke.
As for optimistic unchoking, the algorithm selects one peer to unchoke every 30 seconds, regardless of its upload rate (as long as it is a downloader with interest). In the rotation, new connections are three times more likely to be placed in the current optimistic unchoke position, giving them enough opportunities to upload complete data blocks.
Reference link:
http://bittorrent.org/beps/bep_0003.html