H.264/AVC Video Compression

Advanced video coding standard with motion compensation and prediction.

A New Era for Video: The Need for H.264

By the early 2000s, digital video had become mainstream, largely thanks to the MPEG-2 standard, the technology that powered DVDs and early digital television broadcasts. While revolutionary for its time, MPEG-2 had limitations. Its compression efficiency was not sufficient for the burgeoning world of the internet, where bandwidth was limited and the demand for streaming video was growing exponentially. A standard-definition movie on a DVD could take up several gigabytes; streaming that over a typical DSL connection of the era was simply not feasible.

This challenge spurred a collaborative effort between two major standards organizations: the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. Their goal was ambitious: to create a new video compression standard that could deliver the same visual quality as MPEG-2 but using only half the data. The result of this project, finalized in 2003, was a groundbreaking new standard known by two names: by the ITU-T, and MPEG-4 Part 10, Advanced Video Coding (or AVC), by ISO/IEC. This standard was so successful that it became the single most dominant technology for video distribution, forming the backbone of everything from Blu-ray Discs and online streaming (YouTube, Netflix) to video conferencing and mobile video recording.

The Core Philosophy: An Evolutionary Leap

H.264 did not reinvent video compression from the ground up. Instead, it took the proven concepts of its predecessors and enhanced them with a suite of sophisticated new tools and a much greater degree of flexibility. The fundamental principles of reducing spatial redundancy within frames and temporal redundancy between frames remained, but H.264 executed them with far greater precision and intelligence.

Quarter-pel motion explorer

Move the motion vector and see how sub-pixel sampling shrinks residual energy.

Sampling precision
Ground-truth motion:
(0.75, 0.50) px
Horizontal offset0.75 px

Offsets are measured in pixels. Quarter mode unlocks 0.25 increments.

Vertical offset0.50 px

Offsets are measured in pixels. Quarter mode unlocks 0.25 increments.

Reference block patch
58
60
61
63
64
64
60
63
66
69
71
72
63
67
74
80
84
85
65
71
79
88
95
98
67
75
85
97
108
113
68
78
90
104
118
123
Current frame truth
70
74
76
78
75
82
87
90
80
90
99
104
86
98
110
117
Predicted block
69
73
77
78
75
82
88
91
80
90
99
105
85
97
110
117
Prediction quality
Mean squared error
0.50
PSNR
51.1 dB
Residual variance
0.50
螖MSE vs. half
-4.25

Measured against a ground-truth motion vector of (0.75, 0.5) pixels.

Intra-frame Prediction: Smarter Still Images

Like older codecs, H.264 uses I-frames (Intra-frames) as periodic anchor points that are compressed without reference to other frames. However, H.264's I-frames are compressed much more efficiently. Instead of simply applying a JPEG-like DCT compression to a block, H.264 uses a clever technique called . The encoder predicts the content of a block based on its already coded neighbors above and to the left. It can choose from several prediction modes (e.g., predicting a vertical pattern, a horizontal pattern, or just the average color) and only encodes the small difference between the actual block and its prediction. This significantly reduces the data needed even for keyframes.

Inter-frame Prediction: Motion Compensation on Steroids

The greatest advancements in H.264 are in how it handles P-frames and B-frames. It builds upon the idea of but adds several powerful enhancements that make its predictions far more accurate. More accurate predictions mean the residual (the difference that needs to be encoded) is much smaller, leading to superior compression. These enhancements form the "toolkit" that gives H.264 its power.

The H.264 Toolkit: A Deep Dive into the Innovations

The roughly 50% efficiency gain of H.264 over MPEG-2 comes from a collection of powerful new features. Each one provides a way to make more accurate predictions and encode the resulting data more efficiently.

  • 1. Variable Block-Size Motion Compensation

    Older standards like MPEG-2 were rigid. They divided the image into fixed-size macroblocks, typically 161616 \times 16 pixels, and performed motion compensation for this entire block with a single motion vector. This worked well for large, uniformly moving areas, but it was inefficient for areas with complex motion or fine details.

    H.264 introduced the revolutionary concept of variable block sizes. A 161616 \times 16 macroblock can be partitioned into smaller blocks for more precise motion compensation. It can be divided into 16816 \times 8, 8168 \times 16, or 888 \times 8 blocks. Each of those 888 \times 8 sub-blocks can be further partitioned into 848 \times 4, 484 \times 8, or even tiny 444 \times 4 blocks. This allows the encoder to use large blocks for simple backgrounds and very small blocks to track the intricate movement of small objects or detailed textures. This adaptability results in much more accurate predictions.

    Adaptive macroblock splitter

    Inspect how H.264 retiles a 16脳16 macroblock depending on local detail and motion.

    Partition modes

    Four sub-blocks reuse a compact vector set and already track moderate detail.

    16脳16
    8脳8 (quarter split)
    88
    88
    88
    88
    Block granularity
    Large blocks

    Single motion vector, lowest overhead

    Medium blocks

    Compromise between detail and signalling

    Small blocks

    Tracks intricate textures or thin motion

    Coding cost estimate
    Header + motion bits
    28 bits
    Residual energy
    18
    CABAC symbols
    42
    When the encoder prefers this

    Shoulders against a moving background, window edges, logo animations.

    Finer partitions drive residual energy down but the encoder must spend more bits on headers and CABAC syntax.

  • 2. Quarter-Pixel Motion Vector Precision

    Another leap in accuracy came from motion vector precision. MPEG-2 could reference positions with half-pixel accuracy. H.264 pushes this to precision.

    This does not mean the encoder sees imaginary pixels. Instead, it creates these "sub-pixel" positions by mathematically interpolating between the actual pixels of the reference frame. For example, a half-pixel position is calculated by averaging adjacent pixels, and a quarter-pixel position is then calculated by averaging between a real pixel and a calculated half-pixel position. By being able to reference these much finer, interpolated positions, the encoder can find a much closer match for a block that has moved a very small, non-integer amount. This leads to a smaller prediction error (residual) and thus better compression.

  • 3. Multiple Reference Frames

    Older codecs were limited to referencing only the immediately preceding frame for motion prediction. H.264 breaks this limitation by allowing the encoder to store and reference multiple previous (and for B-frames, future) frames. An encoder might be configured to keep up to 16 reference frames in memory.

    This is incredibly powerful for several scenarios:

    • Occlusion: If an object is temporarily hidden behind another object and then reappears, the encoder can reference the frame from before it was hidden to find a perfect match.
    • Repetitive Motion: For oscillating motions like a waving hand or a bouncing ball, the encoder can reference an earlier frame where the object was in the same position.
    • Fading Scenes: During a cross-fade, referencing frames from both the outgoing and incoming scenes can create a more efficient prediction.
  • 4. In-Loop Deblocking Filter

    A common and visually jarring artifact of block-based compression (like DCT) is "blocking." Because each block is compressed independently, subtle discontinuities can appear at the block edges, making the image look like it is composed of a visible grid, especially at low bitrates.

    H.264 addresses this by including a as a mandatory part of the decoding process. This filter intelligently analyzes the pixels across block boundaries and applies a smoothing algorithm to reduce the visibility of these edges. Crucially, this filtering is done "in-loop," which means the cleaned-up, deblocked frame is then used as the reference for predicting future frames. This not only improves the visual quality of the current frame but also provides a higher-quality reference for the next frames, improving overall compression efficiency and preventing the accumulation of artifacts.

  • 5. Advanced Entropy Coding (CABAC)

    The final step of compression is entropy coding, which losslessly packs the quantized data. While H.264 supports a simpler method (CAVLC) for less demanding applications, its high-efficiency mode uses .

    Unlike Huffman coding, which uses a fixed statistical model for the entire frame, CABAC is adaptive. It understands that the probability of a symbol occurring depends on its context, that is, the values of its neighbors. For example, if the coefficients of neighboring blocks were large, it is more likely that the current block's coefficients will also be large. CABAC dynamically updates its probability models as it processes the data, using the most appropriate context to predict and encode each bit. This context adaptation allows it to squeeze out more redundancy, providing a compression gain of 10 to 15 percent over older entropy coding methods for the same visual quality.

Profiles and Levels: A Standard for All Needs

H.264 is not a single, monolithic standard. It is a flexible framework with different tiers of complexity, defined by and .

  • Profiles: A profile defines which parts of the H.264 "toolkit" are used. Different applications have different needs. A simple mobile video chat does not need the same complex features as a high-definition Blu-ray movie. Common profiles include:
    • Baseline Profile: Uses the simplest features. Does not use B-frames or CABAC. Designed for low-power devices and low-latency applications like video conferencing.
    • Main Profile: A middle ground used for standard-definition digital TV broadcasts. It adds support for B-frames and more advanced coding techniques.
    • High Profile: The most common profile for modern high-definition content. It includes all the advanced features, such as more flexible intra-prediction and quantization matrices, and is used for Blu-ray Discs and most online streaming services.
  • Levels: A level defines the performance limits for a given profile. It sets constraints on parameters like maximum resolution, frame rate, and bitrate. For example, Level 4.1 might support 192010801920 \times 1080 at 30 frames per second, while a higher Level 5.1 would be needed for 4K4K resolution. This system ensures that a device, like a smartphone or a smart TV, only needs to be powerful enough to handle
    H.264/AVC Video Compression