Video Compression

The magic behind video files: inter-frame and intra-frame compression in standards like MPEG and M-JPEG.

Why Video Compression is a Digital Miracle

If static image compression is about smartly packing a suitcase for a trip, then video compression is akin to orchestrating the logistics for a globe-trotting rock band tour. The scale of the data is monumentally larger, and the challenges are far more complex. Video is not just a collection of images; it is a sequence of images displayed in rapid succession, creating the illusion of motion. This sequential nature introduces a new dimension of data that must be managed: time.

Let's consider the raw data size of a short, uncompressed video clip. A single frame of a standard Full HD video ( $1920 \times 1080$ pixels) using 24-bit color (8 bits for each red, green, and blue channel) requires:

$1920 \times 1080 \text{ pixels} \times 24 \text{ bits/pixel} \approx 49.8 \text{ million bits}$

Or about $6.2 \text{ megabytes (MB)}$ per frame.

Standard video runs at about 30 frames per second. So, for just one second of uncompressed video, we would need:

$6.2 \text{ MB/frame} \times 30 \text{ frames/second} = 186 \text{ MB/second}$

A one-minute video would be over 11 gigabytes. A two-hour movie would be over a terabyte. Streaming or storing such massive files would be utterly impractical for consumer applications. This is why effective is not just a convenience; it is the core enabling technology behind streaming services like Netflix, video conferencing like Zoom, and even the simple act of sharing a video from your smartphone.

Video compression algorithms achieve their incredible efficiency by exploiting two fundamental types of redundancy:

Spatial Redundancy (Intra-frame): This is the redundancy within a single frame, just like in a static image. It refers to areas of the same or similar color, like a blue sky or a white wall.
Temporal Redundancy (Inter-frame): This is the redundancy between consecutive frames. In most videos, the change from one frame to the next is very small. A person talking, a car driving across the screen, the background often remains static or changes predictably. This is where the biggest savings in video compression come from.

Intra-frame vs. Inter-frame Compression: The Two Pillars of Video Codecs

Modern video codecs, the software or hardware that handles compression and decompression, employ a sophisticated combination of two distinct strategies. One handles each frame as a standalone picture, while the other cleverly exploits the similarities between frames over time.

Intra-frame Compression: Compressing the Still Picture

addresses spatial redundancy. It treats a single video frame as if it were a static photograph and compresses it on its own, without reference to any preceding or succeeding frames. The process is virtually identical to the JPEG compression algorithm:

The frame is divided into small blocks of pixels (e.g., $8 \times 8$ ).
Each block is processed by the Discrete Cosine Transform (DCT), which converts the spatial pixel values into frequency coefficients, separating the block's essential visual information (low frequencies) from its fine details (high frequencies).
The coefficients are then , a step where less perceptually important high-frequency details are discarded or represented with less precision. This is the primary source of data reduction and quality loss.
The resulting quantized coefficients are finally compressed losslessly using techniques like Huffman coding.

Frames that are compressed using only this method are called I-frames (Intra-frames). They are the backbone of a video stream.

Inter-frame Compression: Predicting the Future from the Past

This is where the true power of video compression lies. addresses temporal redundancy. Instead of encoding every frame from scratch, it predicts the content of the current frame based on one or more previously decoded frames (known as reference frames).

The core technique used here is Motion Compensation. Here is how it works:

The encoder divides the current frame into blocks (often called macroblocks, e.g., $16 \times 16$ pixels).
For each macroblock in the current frame, the encoder searches a reference frame to find the block of pixels that is the closest match.
Instead of encoding the macroblock's actual pixels, the encoder records a . This vector is essentially an instruction like: "take the block from position (x, y) in the last frame and move it to position (x', y') in this frame."
Often, the predicted block is not a perfect match. The encoder then calculates the difference between the actual macroblock and its motion-compensated prediction. This difference, called the residual, is what gets compressed (using DCT and quantization) and sent. Since the prediction is usually very good, the residual contains very little information and can be compressed very efficiently.

Motion compensation explorer

Select a macroblock to see where the encoder pulls prediction energy from.

Scenarios

Background shifts smoothly to the right, so most macroblocks share the same horizontal vector.

Macroblock1 / 3

Vectors point from the reference frame into the current frame block.

Reference frame

Current frame

Legend

Cell colour approximates brightness; rings highlight the block being encoded (green) and the best match in the reference frame (amber).

0 — background / darkest

1 — low intensity

2 — subject

3 — highlight

Current macroblock

Reference match

Motion vector

(+1, +0)

Direction

→Right

Magnitude1.00

Residual energy8

Macroblock covering the object follows a +1 horizontal vector.

Scenario metrics

Average |v|

0.67

Average residual

4.7

Zero vectors

1/3

Frames that are encoded using this predictive method are called P-frames (Predicted frames) and B-frames (Bi-directionally predicted frames).

The Cast of Characters: I-frames, P-frames, and B-frames

Not all frames in a compressed video stream are created equal. Modern codecs like MPEG use a mix of three different frame types to achieve an optimal balance between compression ratio, quality, and functionality.

I-frames (Intra-frames or Keyframes)
I-frames are self-contained. They are compressed using only intra-frame (spatial) techniques and do not depend on any other frames for decoding. They are essentially complete, standalone images (like a JPEG) embedded in the video stream.
Role: They serve as anchor points in the video. They are necessary to start playback, to allow the viewer to seek to a specific point in the video, and to recover from transmission errors. Since they contain the full picture information, they are the largest of the three frame types.
P-frames (Predicted frames)
P-frames are more efficient. They are encoded using motion compensation from the most recent preceding I-frame or P-frame. A P-frame stores only motion vectors and the residual data (the differences). This makes them significantly smaller than I-frames.
Role: They carry the "story" of the motion forward from one keyframe to the next, drastically reducing the amount of data needed to describe the changes between frames.
B-frames (Bi-directionally predicted frames)
B-frames offer the highest level of compression. They use motion compensation by looking both backward to a previous reference frame (I or P) and forward to a future reference frame (I or P). By being able to reference information from two directions, the encoder can often find an even better match (an interpolated prediction), resulting in a very small residual.
Role: B-frames fill in the gaps between I- and P-frames with maximum efficiency. They are the smallest frame type. Their use comes with a trade-off: they introduce latency, as the decoder must wait for the future reference frame to arrive before it can decode the B-frame. This also means the frames must be transmitted and decoded in an order different from their display order.

GOP structure playground

Inspect how I, P, and B frames are ordered and why they shrink bitrate.

Presets

One I-frame followed by forward predicted P-frames.

Display frame1 / 6

B-frames appear later in the encoder order because they wait for future references.

Display order

I1disp #1enc #1

P2disp #2enc #2

P3disp #3enc #3

P4disp #4enc #4

P5disp #5enc #5

P6disp #6enc #6

Encoder order

I1disp #1enc #1

P2disp #2enc #2

P3disp #3enc #3

P4disp #4enc #4

P5disp #5enc #5

P6disp #6enc #6

Selected frame

Frame type

Display order#1

Encoder order#1

References

None

GOP anchor. Decoder can start here.

Frame legend

II-frame (intra anchor)

PP-frame (forward predicted)

BB-frame (bi-directional)

Bitrate impact

All-I stream

7.2 Mb

Preset stream

3 Mb

Saved bits

4.2 Mb

Compression ratio

2.40x

Max reorder delay

0 frames

Comparing Standards: M-JPEG vs. MPEG

The choice of compression standard has a profound impact on file size, quality, and usability. Two major families of standards illustrate different philosophies: Motion JPEG and the MPEG family.

M-JPEG (Motion JPEG): The Simple Approach

M-JPEG is the most straightforward video compression method. It treats a video as nothing more than a series of independent JPEG images streamed one after another.

Method: It uses only intra-frame compression. Every single frame is an I-frame. It completely ignores temporal redundancy between frames.
Pros: Extremely simple to implement. Frame-accurate editing is effortless since every frame is self-contained. There is no latency from frame reordering. Error resilience is high, a corrupted frame only affects that one frame.
Cons: Highly inefficient. File sizes are much larger than with MPEG-based codecs for the same visual quality because it fails to exploit the massive redundancy between frames.
Use Cases: Professional video editing workflows (as an intermediate format), some medical imaging systems, and high-end security cameras where the integrity of every individual frame is prioritized over file size.

MPEG Family (MPEG-2, H.264, H.265, etc.)

The MPEG (Moving Picture Experts Group) family of standards represents the dominant approach to video compression. These standards are built around the powerful concept of motion-compensated inter-frame prediction.

Method: Uses a mix of intra-frame compression (for I-frames) and inter-frame compression (for P- and B-frames) to remove both spatial and temporal redundancy.
Pros: Extremely high compression efficiency, leading to small file sizes suitable for streaming, broadcast, and storage on consumer media like Blu-ray discs.
Cons: Much more computationally complex for both encoding and decoding. Editing is more difficult and usually requires cutting only at I-frames to avoid having to re-encode large sections. More susceptible to transmission errors, a corrupted I- or P-frame can affect the decoding of all subsequent frames until the next I-frame.
Use Cases: Virtually all modern video applications: online streaming (YouTube, Netflix), digital television broadcast, video conferencing, Blu-ray and DVD discs, and smartphone video recording.