TCP Congestion Control

Slow start, congestion avoidance, and fast recovery algorithms.

The Internet as a Highway System: An Analogy for Congestion

Imagine the internet as a massive, global system of highways. Data packets are like cars, and the connections between cities are the highways. When you want to download a file from a server in London while you are in New York, your packets (cars) travel across this vast network. During a quiet time, like 3 AM, the highways are clear, and your cars can travel at maximum speed, arriving quickly and in order.

However, during rush hour, thousands of other people are also trying to use the same highways. Intersections and key stretches of road, which correspond to in the network, become bottlenecks. If too many cars try to enter an intersection at once, a traffic jam occurs. Traffic slows to a crawl, and in the digital world, routers whose memory buffers are full will simply drop any new cars (packets) that arrive. This is network congestion.

This is where TCP Congestion Control becomes vital. It is fundamentally different from Flow Control. Flow Control is a one-on-one conversation between your car and the destination garage, ensuring you do not deliver cars faster than the garage can unload them. In contrast, Congestion Control is about your car being a good citizen on the public highway. It is a set of mechanisms that allow TCP to sense the level of traffic on the network and adjust its sending rate accordingly, not just to protect the receiver, but to protect the entire shared network from collapsing into gridlock.

How TCP Senses Congestion: Inferring from Packet Loss

In an ideal world, routers would send explicit warning signals to senders, like a digital traffic report saying, Highway I-95 is getting busy, please slow down. While some modern mechanisms like ECN (Explicit Congestion Notification) exist, the original and still fundamental way TCP detects congestion is by inference. TCP assumes that the internet is a black box and can only deduce its internal state by observing what happens to the packets it sends.

The primary signal of congestion for TCP is packet loss. The reasoning is simple: the most common reason for a packet to disappear in the modern internet is that it arrived at a router whose internal buffer was already full. The router, having no more space to queue the packet, had no choice but to drop it. Therefore, TCP interprets packet loss as an implicit signal that there is a bottleneck somewhere on the path. TCP has two main ways of detecting that a packet has been lost:

Retransmission Timeout (RTO)
For every segment TCP sends, it starts a timer. It expects to receive an acknowledgment (ACK) from the receiver before this timer expires. If the timer runs out, TCP assumes the segment was lost in the network. This is a severe congestion signal, as it implies no packets are getting through, or the network delay has increased dramatically.
Triple Duplicate ACKs
This is a more subtle and more common signal of congestion. Imagine the sender sends segments 1, 2, 3, 4, and 5. Segment 3 gets lost, but 4 and 5 arrive at the receiver. TCP requires acknowledgments to be in order. The receiver, upon getting segment 4, sees that it is still missing segment 3. It immediately sends an ACK for the last in-order byte it received (the end of segment 2). When segment 5 arrives, it again sends an ACK for the end of segment 2. If the sender receives three ACKs for the same data (the original ACK plus two duplicates), it takes this as a strong hint that the segment immediately following that data was lost, but that subsequent segments are getting through. This is a sign of mild congestion, not a total network failure.

The Congestion Control Toolkit: Key Variables

To implement its congestion control strategy, TCP maintains two critical state variables on the sender side for each connection.

Congestion Window (cwnd)
The 'cwnd' is the sender-s own limit on the number of bytes it can have in flight (sent but not yet acknowledged) at any given time. It is a dynamic value that TCP adjusts based on its perception of network congestion. It works in conjunction with the receiver-s advertised window ('rwnd'). The actual number of bytes the sender is allowed to transmit is the minimum of 'cwnd' and 'rwnd'. This ensures that the sender respects both the network-s capacity and the receiver-s capacity.
Slow Start Threshold (ssthresh)
The 'ssthresh' is a control variable used to determine when TCP should switch from its aggressive initial growth phase (Slow Start) to its more conservative linear growth phase (Congestion Avoidance). Initially, 'ssthresh' is often set to a very large value. However, when congestion is detected, it is typically set to half of the 'cwnd' value at the time the congestion occurred. It acts as a memory of the last known "safe" network capacity.

Phase 1: Slow Start - The Search for Bandwidth

When a new TCP connection begins, the sender has no idea what the available capacity of the network path is. It could be a fast fiber optic link or a slow satellite connection. Starting to send data too aggressively could instantly cause congestion. Starting too cautiously would waste time on a fast network. TCP-s solution is the Slow Start algorithm.

The name is famously misleading. The growth during this phase is anything but slow; it is exponential. It is called "slow" only in comparison to the alternative of immediately sending a full window of data.

Mechanism of Slow Start

The connection begins with a small initial Congestion Window ('cwnd'), typically between 2 and 4 .
For every acknowledgment (ACK) received that confirms new data, the sender increases 'cwnd' by 1 MSS.
Since a sender might send multiple segments in a single round-trip, it can receive multiple ACKs in response. The effect is that 'cwnd' approximately doubles every Round-Trip Time (RTT).

This exponential growth allows TCP to rapidly probe the network to find its approximate available capacity. The Slow Start phase continues until one of two things happens: 'cwnd' reaches the 'ssthresh' value, or a packet loss event occurs. Once 'ssthresh' is reached, TCP transitions to the more cautious Congestion Avoidance phase.

Phase 2: Congestion Avoidance - Gently Probing for More

After the initial exponential ramp-up of Slow Start, TCP enters a more conservative phase called Congestion Avoidance. The goal is no longer to rapidly discover bandwidth, but to gently probe for additional capacity without causing congestion.

This behavior is part of a strategy called Additive Increase, Multiplicative Decrease (AIMD). Congestion Avoidance is the "Additive Increase" part of the algorithm.

Mechanism of Congestion Avoidance

Instead of doubling its 'cwnd' every RTT, the sender now increases 'cwnd' in a linear, additive fashion. The goal is to increase the 'cwnd' by approximately 1 MSS per RTT.

This is achieved by increasing 'cwnd' by a small fraction for every ACK received. The exact formula is $\text{cwnd} = \text{cwnd} + \text{MSS} \times (\text{MSS} / \text{cwnd})$ for each ACK. When an entire window of data has been acknowledged, the total increase will be about 1 MSS. This much slower, linear growth rate carefully tests the network limits. The connection remains in the Congestion Avoidance state until a packet loss event is detected.

Reacting to Congestion: Multiplicative Decrease

The "Multiplicative Decrease" half of AIMD dictates how TCP reacts when it finally detects congestion (packet loss). The reaction depends on how the loss was detected.

Reaction to a Timeout (RTO)

A Retransmission Timeout is a signal of severe congestion. TCP-s reaction is aggressive:

The 'ssthresh' is set to half of the current 'cwnd' value. This records a new, lower "safe" limit.
The 'cwnd' is reset to its small initial value (e.g., 1 MSS).
The connection re-enters the Slow Start phase.

This dramatic reduction and restart ensures that the sender quickly reduces its load on a heavily congested network.

Reaction to Triple Duplicate ACKs: Fast Retransmit and Fast Recovery

Triple duplicate ACKs signal a less severe problem. It means a single segment was likely lost, but the network is still capable of delivering subsequent segments. Reacting with a full Slow Start would be an overreaction and would needlessly reduce throughput. To handle this more intelligently, TCP uses the Fast Retransmit and Fast Recovery algorithms.

Fast Retransmit: Upon receiving the third duplicate ACK, the sender does not wait for the RTO timer to expire. It immediately retransmits the missing segment that the ACKs are pointing to.
Fast Recovery: Instead of resetting 'cwnd' to 1, TCP performs a multiplicative decrease:
- The 'ssthresh' is set to half of the current 'cwnd'.
- The 'cwnd' is also set to this new 'ssthresh' value (plus a small amount for the duplicate ACKs already received).
Then, importantly, TCP directly enters the Congestion Avoidance phase, continuing its linear growth from this new, lower starting point.

This approach avoids the performance penalty of a full Slow Start, allowing the connection to recover from minor packet loss much more quickly and keep the network pipe more consistently full. This is the behavior defined by the TCP Reno algorithm.

The Evolution of Congestion Control

The classic TCP Reno algorithm (Slow Start, Congestion Avoidance, Fast Recovery) has been the workhorse of the internet for decades. However, as networks have evolved to become much faster and more complex (so-called Long Fat Networks, or LFNs), new challenges have emerged, leading to the development of more advanced algorithms.

TCP NewReno
An improvement on TCP Reno that more gracefully handles multiple packet losses within a single window of data. It avoids dropping into Slow Start multiple times when multiple segments are lost, leading to smoother recovery.
CUBIC
CUBIC is the default congestion control algorithm in modern Linux kernels (and thus many Android devices and web servers). Unlike the linear increase of Reno, CUBIC uses a cubic function to govern its window growth. It grows rapidly at first, then slows down as it approaches the last known maximum 'cwnd', and then probes aggressively again for new bandwidth. This behavior is better suited for high Bandwidth-Delay Product networks, allowing it to scale throughput more effectively and fairly.
BBR (Bottleneck Bandwidth and Round-trip propagation time)
Developed by Google, BBR takes a fundamentally different approach. Instead of relying on packet loss as the primary signal of congestion, BBR attempts to build an explicit model of the network path. It continuously measures the two key parameters of the path: the bottleneck bandwidth (the slowest link) and the round-trip propagation time (the base latency). It then tries to adjust its sending rate to match the measured bandwidth, aiming to keep the network pipe full but without creating the excess queues in router buffers (a problem known as bufferbloat) that traditional loss-based algorithms can cause. It represents a major shift in congestion control philosophy.