Bittorrent uTorrent transfer protocol

uTorrent Transmission Protocol (uTP) was designed by Ludvig Strigeus, Greg Hazel, Stanislav Shalunov, Arvid Norberg, and Bram Cohen.

Design Motivation#

The motivation behind the uTP protocol is to allow BitTorrent clients to avoid interrupting internet connections while still fully utilizing unused bandwidth.

BitTorrent traffic is typically background transmission, which should have a lower priority than checking emails, office work, and browsing the web. However, when using regular TCP connections, BitTorrent can quickly fill the sending buffer, adding several seconds of delay to all interactive traffic. The fact that BitTorrent uses multiple TCP connections gives it an unfair advantage when competing for bandwidth with other services, exaggerating the effect of BitTorrent occupying upload bandwidth. This occurs because TCP evenly distributes available bandwidth among connections, and the more connections an application uses, the larger its share of bandwidth.

The traditional solution to this problem is to limit the upload rate of the BitTorrent client to 80% of the upstream bandwidth capacity, leaving some room for the remaining upstream and downstream traffic. The main drawbacks of this solution are:

Users need to configure their BitTorrent client, which is not plug-and-play.
Users need to know the upper capacity of their internet connection. This capacity may change, especially on laptops or phones that may connect to many different networks.
The 20% margin is somewhat arbitrary and can waste bandwidth. Whenever there is no interactive traffic competing with BitTorrent, the extra 20% is wasted. Whenever there is competing interactive traffic, it cannot simply use 20% of the capacity.

uTP addresses this issue by using the modem queue size as its rate controller. When the queue becomes too large, it throttles the traffic. This allows it to utilize the full upload capacity when there is no competition and allows it to reduce to nearly zero when there is a lot of interactive traffic.

Overview#

This document assumes that the reader has some understanding of how TCP and window-based congestion control work. uTP is a transport protocol layered on top of UDP. Therefore, it must (and is capable of) implementing its own network congestion control. The main difference from TCP is the delay-based congestion control.

uTP is a transport protocol built on UDP, so it needs to implement its own congestion control mechanism. The primary distinction of uTP compared to TCP lies in its delay-based congestion control. Specific details can be found in the description of the congestion control section. Similar to TCP, uTP employs window-based congestion control. Each socket has a max_window that determines the maximum number of bytes that can be transmitted simultaneously by the socket at any given time. Any packets sent but not yet acknowledged are considered in transit.

cur_window represents the number of bytes currently in transit. A socket can only send a packet if cur_window + packet_size is less than or equal to min(max_window, wnd_size). packet_size represents the size of the packet, which may have different values.
wnd_size is the window size advertised by the peer port. It sets the upper limit on the number of packets in transit.
If max_window is less than the packet size, and by adjusting the packet transmission rate, the average cur_window is less than or equal to max_window, this may violate the above rules.
Each socket maintains the last delay measurement state (reply_micro) with the other endpoint. Whenever a packet is received, this state is updated by subtracting the timestamp (in microseconds) from the current time of the host.
Each time a packet is sent, the socket's reply_micro value will be placed in the timestamp_difference_microseconds field of the packet header.
Unlike TCP, the sequence numbers and ACKs in uTP are based on packets rather than bytes. This means that uTP cannot repackage them when retransmitting.
Each socket keeps track of the next sequence number (seq_nr) for sending packets and the last received packet's sequence number (ack_nr). The oldest unacknowledged packet's sequence number is seq_nr – cur_window.

Header Format#

Version 1 Header:

0       4       8               16              24              32
+-------+-------+---------------+---------------+---------------+
| type  | ver   | extension     | connection_id                 |
+-------+-------+---------------+---------------+---------------+
| timestamp_microseconds                                        |
+---------------+---------------+---------------+---------------+
| timestamp_difference_microseconds                             |
+---------------+---------------+---------------+---------------+
| wnd_size                                                      |
+---------------+---------------+---------------+---------------+
| seq_nr                        | ack_nr                        |
+---------------+---------------+---------------+---------------+

All fields are arranged in network byte order (big-endian).

Version#

This is the protocol version. The current version is 1.

connection_id#

This is a randomly generated unique number used to identify all packets belonging to the same connection. Each socket has a connection ID for sending packets and a different connection ID for receiving packets. The endpoint initiating the connection decides which ID to use, and the return path has the same ID + 1.

timestamp_microseconds#

This is the "microseconds" part of the timestamp for sending this packet. It is set using gettimeofday() on POSIX and QueryPerformanceTimer() on Windows. The higher the resolution of this timestamp, the better. The closer it is set to the actual transmission time, the better.

timestamp_difference_microseconds#

This is the difference between the local time and the timestamp in the last received packet (when the last packet was received). This is the latest one-way delay measurement from the remote peer to the local computer. When the socket is newly opened and has no delay samples yet, it must be set to 0.

wnd_size#

The advertised receive window. This is specified in bytes and is 32 bits wide. The window size is the number of bytes currently in transit, i.e., the number of bytes sent but not acknowledged. The advertised receive window allows the other end to limit the window size if it cannot receive faster, particularly if its receive buffer is filling up. When sending packets, it should be set to the remaining bytes in the socket's receive buffer.

extension#

The type of the first extension in the linked list of extended headers. 0 indicates no extension.

Currently, there is one extension:

Selective Acknowledgment

Extensions are linked, similar to TCP options. If the extension field is not zero, the two bytes immediately following the uTP header are:

0               8               16
+---------------+---------------+
| extension     | len           |
+---------------+---------------+

Where extension specifies the type of the next extension in the list, and 0 terminates the list. len specifies the number of bytes for this extension. Unknown extensions can be skipped by simply advancing len bytes.

SELECTIVE ACK#

Selective ACK is an extension that can selectively ACK packets out of order. Its payload is a bitmask of at least 32 bits, expressed in multiples of 32 bits. Each bit represents a packet in the sending window. Bits outside the sending window are ignored. Set bits indicate that the packet has been received, while cleared bits indicate that the packet has not been received. The header is as follows:

0               8               16
+---------------+---------------+---------------+---------------+
| extension     | len           | bitmask
+---------------+---------------+---------------+---------------+
                                |
+---------------+---------------+

Note that the len field of the extension refers to bytes, and in this extension, the bytes must be at least 4 and a multiple of 4.

Selective ACK is only sent when at least one sequence number has been skipped in the received stream. Therefore, the first bit in the mask represents ack_nr + 2. ack_nr + 1 is assumed to have been dropped or lost when this packet is sent. Set bits indicate received packets, while cleared bits indicate packets that have not been received.

The byte order of the bitmask is reversed. The first byte represents packets [ack_nr + 2, ack_nr + 2 + 7] in reverse order. The least significant bit in the byte represents ack_nr + 2, and the most significant bit in the byte represents ack_nr + 2 + 7. The next byte in the mask represents [ack_nr + 2 + 8, ack_nr + 2 + 15] in reverse order, and so on. The bitmask is not limited to 32 bits but can be of any size.

Below is the layout of the bitmask representing the acknowledgment of the first 32 packets in the selective ACK bit field:

0               8               16
+---------------+---------------+---------------+---------------+
| 9 8 ...   3 2 | 17   ...   10 | 25   ...   18 | 33   ...   26 |
+---------------+---------------+---------------+---------------+

The numbers in the figure map the bits in the bitmask to the offsets to be added to ack_nr to calculate the sequence numbers being acknowledged.

type#

The type field describes the type of the packet. It can be one of the following:

ST_DATA = 0

Regular data packet. The socket is in a connected state and has data to send. ST_DATA packets always have a data payload.

ST_FIN = 1

Finish connection. This is the last packet. It closes the connection, similar to the TCP FIN flag. The sequence number for this connection will never be greater than the sequence number in this packet. The socket records this sequence number as eof_pkt. This allows the socket to wait for possibly still lost packets that may arrive out of order even after receiving the ST_FIN packet.

ST_STATE = 2

State packet. Used to transmit ACK without data. Packets without any payload do not increase seq_nr.

ST_RESET = 3

Forcefully terminate the connection. Similar to the TCP RST flag. The remote host has no state for this connection. It is obsolete and should be terminated.

ST_SYN = 4

Similar to the TCP SYN flag, this packet initiates the connection. The sequence number is initialized to 1. The connection ID is initialized to a random number. SYN packets are special; all subsequent packets sent on this connection (except for retransmissions of ST_SYN) are sent with connection ID + 1. The connection ID is the ID that the other end should use in its response.

When receiving ST_SYN, a new socket should be initialized using the ID in the packet header. The sending ID of the socket should be initialized to ID + 1. The sequence number for the return channel is initialized to a random number. The other end requires a ST_STATE packet (only ACK) in response.

seq_nr#

This is the sequence number of this packet. Unlike TCP, uTP sequence numbers refer to packets rather than bytes. The sequence number tells the other end in what order the packets should be returned to the application layer.

ack_nr#

This is the last sequence number received by the sender in the other direction.

Connection Setup#

The following diagram illustrates the exchange and state of initiating a connection. c.* represents the state within the socket itself, and pkt.* represents the fields in the packet header.

initiating endpoint                           accepting endpoint

          | c.state = CS_SYN_SENT                         |
          | c.seq_nr = 1                                  |
          | c.conn_id_recv = rand()                       |
          | c.conn_id_send = c.conn_id_recv + 1           |
          |                                               |
          |                                               |
          | ST_SYN                                        |
          |   seq_nr=c.seq_nr++                           |
          |   ack_nr=*                                    |
          |   conn_id=c.rcv_conn_id                       |
          | >-------------------------------------------> |
          |             c.receive_conn_id = pkt.conn_id+1 |
          |             c.send_conn_id = pkt.conn_id      |
          |             c.seq_nr = rand()                 |
          |             c.ack_nr = pkt.seq_nr             |
          |             c.state = CS_SYN_RECV             |
          |                                               |
          |                                               |
          |                                               |
          |                                               |
          |                     ST_STATE                  |
          |                       seq_nr=c.seq_nr++       |
          |                       ack_nr=c.ack_nr         |
          |                       conn_id=c.send_conn_id  |
          | <------------------------------------------<  |
          | c.state = CS_CONNECTED                        |
          | c.ack_nr = pkt.seq_nr                         |
          |                                               |
          |                                               |
          |                                               |
          | ST_DATA                                       |
          |   seq_nr=c.seq_nr++                           |
          |   ack_nr=c.ack_nr                             |
          |   conn_id=c.conn_id_send                      |
          | >-------------------------------------------> |
          |                        c.ack_nr = pkt.seq_nr  |
          |                        c.state = CS_CONNECTED |
          |                                               |
          |                                               | connection established
     .. ..|.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..|.. ..
          |                                               |
          |                     ST_DATA                   |
          |                       seq_nr=c.seq_nr++       |
          |                       ack_nr=c.ack_nr         |
          |                       conn_id=c.send_conn_id  |
          | <------------------------------------------<  |
          | c.ack_nr = pkt.seq_nr                         |
          |                                               |
          |                                               |
          V                                               V

Connections are identified by their conn_id header. If the connection ID of the new connection conflicts with an existing connection, the connection attempt will fail because the ST_SYN packet will be unexpected in the existing stream and will be ignored.

Packet Loss#

If the packet with the sequence number (seq_nr – cur_window) has not been acknowledged (this is the earliest packet in the sending buffer, the next packet expected to be acknowledged), but three or more packets have been sent after that packet (via selective ACK), it is assumed that the packet has been lost. Similarly, when three duplicate acknowledgments are received, ack_nr assumed + 1 has been lost (if a packet with that sequence number has been sent).

This also applies to selective acknowledgment. Each packet acknowledged in the selective acknowledgment message counts as a duplicate acknowledgment, and if there are three or more, it should trigger the retransmission of at least three packets.

When a packet is lost, max_window is multiplied by 0.5 to simulate TCP.

Timeout#

Each acknowledged packet, whether it falls within the range (last_ack_nr, ack_nr] or is explicitly acknowledged by a selective ACK message, is applied to update (round-trip time) and rtt rtt_var (rtt variance) measurements. last_ack_nr here is the last ack_nr received on the socket before the current packet, and ack_nr is the field in the currently received packet.

Only packets sent once for rtt are updated and rtt_var. This avoids the issue of determining which packet was acknowledged, the first or the second.

rtt is calculated using the following formula each time a packet is acknowledged:

delta = rtt - packet_rtt
rtt_var += (abs(delta) - rtt_var) / 4;
rtt += (packet_rtt - rtt) / 8;

The default timeout associated with packets in the socket is also updated each time rtt and rtt_var are updated. It is set to:

timeout = max(rtt + rtt_var * 4, 500);

Where the timeout is specified in milliseconds. That is, the minimum timeout for a packet is 1/2 second.

Each time a socket sends or receives a packet, it updates its timeout counter. If no packets arrive within the milliseconds timeout since the last timeout counter reset, the socket will trigger a timeout. It will set its packet_size and max_window to the minimum packet size (150 bytes). This allows it to send another packet, and if the window size drops to zero, this is how the socket restarts again.

The initial timeout is set to 1000 milliseconds and is updated later according to the above formula. For each consecutive packet that times out, the timeout will double.

Packet Size#

To minimize the impact on slow congested links, uTP adjusts its packet size to 150 bytes per packet. The benefit of using such small packets is that they do not block slow upstream links and have longer serialization delays. The cost of using such small packets is that the overhead of the packet header becomes significant. At high rates, larger packet sizes are used, while at slower rates, smaller packet sizes are used.

Congestion Control#

The overall goal of uTP congestion control is to use one-way buffer delay as the primary congestion measurement, along with packet loss (as in TCP). The key is to avoid running with a full sending buffer while sending data. This is particularly an issue for DSL/cable modems, where the sending buffer in the modem often has space to accommodate several seconds of data. The ideal buffer utilization for uTP (or any background traffic protocol) is to operate with a 0-byte buffer utilization. That is, any other traffic can be sent at any time without being obstructed by background traffic blocking the sending buffer. In practice, the target delay for uTP is set to 100 milliseconds. Each socket's goal is to never see more than 100 milliseconds of delay on the sending link. If it does, it will throttle back.

This effectively makes uTP yield to any TCP traffic.

This is achieved by including high-resolution timestamps in each packet sent via uTP, allowing the receiving end to calculate the difference between its own high-resolution timer and the timestamp in the received packet. This difference is then fed back to the original sender of the packet (timestamp_difference_microseconds). This value is meaningless as an absolute value. Clocks in machines are likely to be unsynchronized, especially without microsecond resolution, and the transmission time of packets is also included in the differences of these timestamps. However, the value is useful compared to previous values.

Each socket maintains a sliding minimum of the lowest value over the last two minutes. This value is called base_delay and serves as a benchmark for the minimum delay between hosts. By subtracting base_delay from the timestamp difference of each packet, you can measure the current buffering delay on the socket. This measurement is called our_delay. It has a lot of noise but serves as a driver to determine whether to increase or decrease the sending window (controlling the sending rate).

CCONTROL_TARGET is the buffering delay that uTP accepts on the upstream link. Currently, the target delay is set to 100 milliseconds, and off_target is the distance between the measured delay and the target delay (calculated as CCONTROL_TARGET – our_delay).

The window size in the socket structure specifies the total number of outstanding (unacknowledged) bytes we may have on the connection. The sending rate is directly related to this window size. The more bytes in transit, the faster the sending rate. In the code, the window size is referred to as max_window. Its size is roughly controlled by the following expression:

delay_factor = off_target / CCONTROL_TARGET;
window_factor = outstanding_packet / max_window;
scaled_gain = MAX_CWND_INCREASE_PACKETS_PER_RTT * delay_factor * window_factor;

Where the first factor scales off_target to target delay units.

Then scaled_gain is added to max_window:

max_window += scaled_gain;

If off_target is greater than 0, this will shrink the window, and if off_target is less than 0, the window will grow.

If max_window is less than 0, it is set to 0. A window size of zero indicates that the socket may not send any packets. In this state, the socket will trigger a timeout and force the window size to one packet size and send one packet. For more details, see the section on timeouts.

Reference Links#

http://www.bittorrent.org/beps/bep_0029.html