Streaming Compression
Compression with low latency for live video streaming and real-time communication.
The Fundamental Challenge: Streaming vs. Downloading
Before exploring the nuances of compression for streaming, it is vital to understand the difference between downloading a file and streaming it. When you download a file, say a movie, your computer saves the entire file to your hard drive. You must wait for the whole file to arrive before you can start watching. Streaming, on the other hand, is a continuous process. Your video player starts playing the beginning of the file almost immediately, while the rest of the file continues to arrive in the background. This creates the seamless experience we associate with services like YouTube or Netflix.
This "play-as-you-go" model introduces a critical variable that is less important for a simple download: time. For streaming to work, the data must arrive at your device at least as fast as you are watching it. If it does not, playback will stop, and you will see the dreaded buffering wheel. This constant race against the clock puts unique and intense pressures on the compression algorithms used for streaming content. They must not only make the file small but do so in a way that is optimized for real-time delivery and playback over potentially unreliable networks.
We can further divide streaming into two broad categories, each with its own set of challenges and compression strategies: on-demand streaming and live, real-time streaming.
The Great Trade-off: Compression Efficiency vs. Latency
The world of streaming compression is governed by a fundamental and often unforgiving trade-off. On one side, we have the goal of maximum compression efficiency, that is, making the video file as small as possible to save bandwidth. On the other side, we have the requirement of low , or minimizing the delay between the sender and the viewer. These two goals are almost always in direct conflict.
Why High-Efficiency Compression Creates Latency
The most powerful tools in a video codec's arsenal, which allow it to achieve incredible compression ratios, are based on inter-frame prediction, especially using B-frames.
- P-frames (Predicted frames): These frames save space by referencing a past frame and only encoding the differences. This introduces a small amount of latency because the decoder needs the complete past frame before it can construct the current P-frame.
- B-frames (Bi-directionally predicted frames): These frames are the masters of efficiency. They can reference both a past frame and a future frame to make their prediction. This bi-directional view often results in the smallest possible data size for that frame.
Herein lies the problem. For a decoder to construct a B-frame, it must not only have the past reference frame but it must also wait for the future reference frame to arrive. The video encoder must hold onto a buffer of frames, process them out of order to create these dependencies, and send them. The decoder must then receive and buffer these frames, reorder them, and then finally decode and display the B-frame. This entire process of buffering and reordering introduces a significant delay. This is latency.
For on-demand streaming like Netflix, this is perfectly acceptable. When you press play on a movie, a few seconds of initial buffering is a small price to pay for a high-quality, low-bandwidth stream. The system can use complex arrangements of B-frames to achieve maximum compression because a multi-second delay at the start is not a problem. However, for live, real-time communication like a Zoom call or a live sports broadcast on Twitch, this kind of delay is a complete disaster. A five-second delay would make a conversation impossible and would mean you hear your friends react to a goal long before you see it on screen. Therefore, compression strategies for live streaming must sacrifice some efficiency to keep latency to an absolute minimum.
Strategy 1 for Live Streaming: Tuning the Codec
The first step in reducing latency is to configure the video encoder itself to prioritize speed over ultimate compression. This involves carefully tuning how it uses different frame types.
Managing the Group of Pictures (GOP)
The repeating sequence of frame types in a video stream is known as the . The structure and length of the GOP have a direct and dramatic impact on both latency and compressibility.
- Long GOP (High Latency, High Compression): Used for on-demand content. A typical GOP might be 2 seconds long, meaning an I-frame is sent only once every 60 frames (for 30 fps video). This GOP will be filled with many P- and B-frames. This allows for maximum compression, as most frames are highly-compressed predictions, but it also creates high latency.
- Short GOP (Lower Latency, Lower Compression): For live broadcasting, the GOP size is often reduced to 1 second or even shorter. Sending I-frames more frequently reduces the dependency chain of predicted frames, which can help a viewer's stream recover more quickly from errors. This slightly increases the average bitrate but reduces latency and improves robustness.
- Zero-Latency Profiles (I-P-P-P...): For the lowest latency applications like video conferencing, B-frames are often completely disabled. The GOP consists of only I-frames and P-frames. Since there are no forward references, the need to buffer future frames is eliminated, drastically reducing latency.
- I-frame Only: In some specialized cases, like professional video editing or some high-security surveillance, the stream might consist entirely of I-frames. This is effectively Motion JPEG (M-JPEG). It provides the lowest possible latency and frame-accurate editing but results in a very high bitrate.
Strategy 2: Adaptive Bitrate Streaming (ABR)
One of the biggest challenges for any kind of streaming is the unpredictability of the internet. A user's network connection can change from moment to moment. They might move from a strong Wi-Fi signal to a weak cellular signal, or someone else in the house might start a large download. If a server simply sent a single, high-quality stream, any dip in network performance would cause playback to freeze.
is the elegant solution to this problem. Instead of creating just one version of the video, the server creates multiple versions, called renditions, at different qualities and bitrates (e.g., 480p, 720p, 1080p, 4K).
The process then works as follows:
- Segmentation: Each of these renditions is broken down into small, consistent-length chunks or segments, typically 2 to 10 seconds long.
- Manifest File: The server also provides a (like M3U8 for HLS or MPD for DASH). This file is the "map" for the video player.
- Intelligent Player: The video player on your device (e.g., in the YouTube app or a web browser) starts by downloading the manifest. It then constantly monitors your current network bandwidth.
- Dynamic Segment Requesting: Based on the network conditions, the player intelligently requests the next video segment from the highest-quality rendition it thinks it can download reliably without buffering. If your bandwidth is high, it requests the 1080p segment. If your connection suddenly worsens, for the next segment it will request the 720p version.
This all happens seamlessly in the background, resulting in a smooth viewing experience that automatically adapts to your connection speed, prioritizing continuous playback over maximum quality when necessary. ABR is a cornerstone of modern streaming services for both on-demand and live content.
Strategy 3: Low-Latency Protocols
The Problem with TCP
The standard protocol for most internet data is TCP (Transmission Control Protocol). TCP is designed for reliability. If a packet is lost, TCP will hold everything up and retransmit the lost packet to ensure that the file arrives perfectly. For downloading a file or loading a webpage, this is exactly what you want. But for a live video call, this is a disaster. If a small packet of audio is lost, it is far better to just skip it and move on than to halt the entire conversation to wait for a retransmission of sound that would now be out of date.
The Rise of UDP and Real-Time Protocols
This is why real-time streaming applications almost universally use . UDP is a fire-and-forget protocol. It sends packets without any guarantee of delivery or order. This sounds bad, but for live streaming, it is perfect. It allows data to flow as fast as possible without getting bogged down by retransmissions. The loss of a single packet might result in a tiny, momentary visual glitch, which is far preferable to a multi-second freeze.
To add necessary functionality like timestamps and sequence numbers on top of UDP, streaming applications use higher-level protocols:
- RTP (Real-time Transport Protocol): This is the standard for carrying the actual media (audio and video) data. It adds sequence numbers so the receiver can detect packet loss and reorder packets that arrive out of sequence, and it adds timestamps to synchronize the audio and video streams.
- RTCP (RTP Control Protocol): This works alongside RTP. It sends periodic reports back to the sender containing statistics about the connection, such as packet loss rates and jitter. This feedback allows the sender to adapt its encoding bitrate in real-time.
- Newer Protocols (SRT, WebRTC): Technologies like Secure Reliable Transport (SRT) and Web Real-Time Communication (WebRTC) build upon these concepts to provide even lower latency (sub-second) and better performance over unreliable public internet connections.