Video Compression
The magic behind video files: inter-frame and intra-frame compression in standards like MPEG and M-JPEG.
Why Video Compression is a Digital Miracle
If static image compression is about smartly packing a suitcase for a trip, then video compression is akin to orchestrating the logistics for a globe-trotting rock band tour. The scale of the data is monumentally larger, and the challenges are far more complex. Video is not just a collection of images; it is a sequence of images displayed in rapid succession, creating the illusion of motion. This sequential nature introduces a new dimension of data that must be managed: time.
Let's consider the raw data size of a short, uncompressed video clip. A single frame of a standard Full HD video ( pixels) using 24-bit color (8 bits for each red, green, and blue channel) requires:
Or about per frame.
Standard video runs at about 30 frames per second. So, for just one second of uncompressed video, we would need:
A one-minute video would be over 11 gigabytes. A two-hour movie would be over a terabyte. Streaming or storing such massive files would be utterly impractical for consumer applications. This is why effective is not just a convenience; it is the core enabling technology behind streaming services like Netflix, video conferencing like Zoom, and even the simple act of sharing a video from your smartphone.
Video compression algorithms achieve their incredible efficiency by exploiting two fundamental types of redundancy:
- Spatial Redundancy (Intra-frame): This is the redundancy within a single frame, just like in a static image. It refers to areas of the same or similar color, like a blue sky or a white wall.
- Temporal Redundancy (Inter-frame): This is the redundancy between consecutive frames. In most videos, the change from one frame to the next is very small. A person talking, a car driving across the screen, the background often remains static or changes predictably. This is where the biggest savings in video compression come from.
Intra-frame vs. Inter-frame Compression: The Two Pillars of Video Codecs
Modern video codecs, the software or hardware that handles compression and decompression, employ a sophisticated combination of two distinct strategies. One handles each frame as a standalone picture, while the other cleverly exploits the similarities between frames over time.
Intra-frame Compression: Compressing the Still Picture
addresses spatial redundancy. It treats a single video frame as if it were a static photograph and compresses it on its own, without reference to any preceding or succeeding frames. The process is virtually identical to the JPEG compression algorithm:
- The frame is divided into small blocks of pixels (e.g., ).
- Each block is processed by the Discrete Cosine Transform (DCT), which converts the spatial pixel values into frequency coefficients, separating the block's essential visual information (low frequencies) from its fine details (high frequencies).
- The coefficients are then , a step where less perceptually important high-frequency details are discarded or represented with less precision. This is the primary source of data reduction and quality loss.
- The resulting quantized coefficients are finally compressed losslessly using techniques like Huffman coding.
Frames that are compressed using only this method are called I-frames (Intra-frames). They are the backbone of a video stream.
Inter-frame Compression: Predicting the Future from the Past
This is where the true power of video compression lies. addresses temporal redundancy. Instead of encoding every frame from scratch, it predicts the content of the current frame based on one or more previously decoded frames (known as reference frames).
The core technique used here is Motion Compensation. Here is how it works:
- The encoder divides the current frame into blocks (often called macroblocks, e.g., pixels).
- For each macroblock in the current frame, the encoder searches a reference frame to find the block of pixels that is the closest match.
- Instead of encoding the macroblock's actual pixels, the encoder records a . This vector is essentially an instruction like: "take the block from position (x, y) in the last frame and move it to position (x', y') in this frame."
- Often, the predicted block is not a perfect match. The encoder then calculates the difference between the actual macroblock and its motion-compensated prediction. This difference, called the residual, is what gets compressed (using DCT and quantization) and sent. Since the prediction is usually very good, the residual contains very little information and can be compressed very efficiently.
Motion compensation explorer
Select a macroblock to see where the encoder pulls prediction energy from.
Background shifts smoothly to the right, so most macroblocks share the same horizontal vector.
Vectors point from the reference frame into the current frame block.
Cell colour approximates brightness; rings highlight the block being encoded (green) and the best match in the reference frame (amber).
Macroblock covering the object follows a +1 horizontal vector.
Frames that are encoded using this predictive method are called P-frames (Predicted frames) and B-frames (Bi-directionally predicted frames).
The Cast of Characters: I-frames, P-frames, and B-frames
Not all frames in a compressed video stream are created equal. Modern codecs like MPEG use a mix of three different frame types to achieve an optimal balance between compression ratio, quality, and functionality.
I-frames (Intra-frames or Keyframes)
I-frames are self-contained. They are compressed using only intra-frame (spatial) techniques and do not depend on any other frames for decoding. They are essentially complete, standalone images (like a JPEG) embedded in the video stream.
Role: They serve as anchor points in the video. They are necessary to start playback, to allow the viewer to seek to a specific point in the video, and to recover from transmission errors. Since they contain the full picture information, they are the largest of the three frame types.
P-frames (Predicted frames)
P-frames are more efficient. They are encoded using motion compensation from the most recent preceding I-frame or P-frame. A P-frame stores only motion vectors and the residual data (the differences). This makes them significantly smaller than I-frames.
Role: They carry the "story" of the motion forward from one keyframe to the next, drastically reducing the amount of data needed to describe the changes between frames.
B-frames (Bi-directionally predicted frames)
B-frames offer the highest level of compression. They use motion compensation by looking both backward to a previous reference frame (I or P) and forward to a future reference frame (I or P). By being able to reference information from two directions, the encoder can often find an even better match (an interpolated prediction), resulting in a very small residual.
Role: B-frames fill in the gaps between I- and P-frames with maximum efficiency. They are the smallest frame type. Their use comes with a trade-off: they introduce latency, as the decoder must wait for the future reference frame to arrive before it can decode the B-frame. This also means the frames must be transmitted and decoded in an order different from their display order.
GOP structure playground
Inspect how I, P, and B frames are ordered and why they shrink bitrate.
One I-frame followed by forward predicted P-frames.
B-frames appear later in the encoder order because they wait for future references.
GOP anchor. Decoder can start here.
Comparing Standards: M-JPEG vs. MPEG
The choice of compression standard has a profound impact on file size, quality, and usability. Two major families of standards illustrate different philosophies: Motion JPEG and the MPEG family.
M-JPEG (Motion JPEG): The Simple Approach
M-JPEG is the most straightforward video compression method. It treats a video as nothing more than a series of independent JPEG images streamed one after another.
- Method: It uses only intra-frame compression. Every single frame is an I-frame. It completely ignores temporal redundancy between frames.
- Pros: Extremely simple to implement. Frame-accurate editing is effortless since every frame is self-contained. There is no latency from frame reordering. Error resilience is high, a corrupted frame only affects that one frame.
- Cons: Highly inefficient. File sizes are much larger than with MPEG-based codecs for the same visual quality because it fails to exploit the massive redundancy between frames.
- Use Cases: Professional video editing workflows (as an intermediate format), some medical imaging systems, and high-end security cameras where the integrity of every individual frame is prioritized over file size.
MPEG Family (MPEG-2, H.264, H.265, etc.)
The MPEG (Moving Picture Experts Group) family of standards represents the dominant approach to video compression. These standards are built around the powerful concept of motion-compensated inter-frame prediction.
- Method: Uses a mix of intra-frame compression (for I-frames) and inter-frame compression (for P- and B-frames) to remove both spatial and temporal redundancy.
- Pros: Extremely high compression efficiency, leading to small file sizes suitable for streaming, broadcast, and storage on consumer media like Blu-ray discs.
- Cons: Much more computationally complex for both encoding and decoding. Editing is more difficult and usually requires cutting only at I-frames to avoid having to re-encode large sections. More susceptible to transmission errors, a corrupted I- or P-frame can affect the decoding of all subsequent frames until the next I-frame.
- Use Cases: Virtually all modern video applications: online streaming (YouTube, Netflix), digital television broadcast, video conferencing, Blu-ray and DVD discs, and smartphone video recording.