Arithmetic Coding

Advanced entropy coding technique achieving near-optimal compression ratios.

Beyond Integer Bits: A New Paradigm in Compression

In our exploration of lossless compression, we delved into Huffman coding, a brilliant method that assigns variable-length codes to symbols based on their frequency. It is optimal in a specific sense: no other prefix code can achieve a better average code length. However, it has an inherent limitation. Huffman coding must assign a whole, integer number of bits to each symbol. For example, a character might be assigned a 3-bit code or a 4-bit code, but never a 3.5-bit code.

Yet, as we learned from Shannon's Information Theory, the true information content of a symbol, its entropy, is often not an integer. A symbol might theoretically carry 2.58 bits of information. By being constrained to integer-length codes, Huffman coding can only approximate this theoretical limit. This raises a natural question: is there a method that can break free from this integer constraint and get even closer to the true entropy of the data? The answer is yes, and it is a powerful and elegant technique known as .

Arithmetic coding represents a fundamentally different and more powerful approach to entropy coding. Instead of replacing each symbol with a distinct binary code, it takes an entire message and represents it as a single, high-precision fractional number between 0 and 1. This ingenious method effectively allows for a fractional number of bits to be allocated to each symbol, enabling compression ratios that are consistently closer to the theoretical limit defined by entropy.

The Core Concept: A Message as a Point on the Number Line

To understand arithmetic coding, it is best to discard the idea of one-to-one code substitution and instead visualize the process as progressively narrowing down a range on the number line.

The Initial Range

The process begins by defining an initial range, typically from 0.0 up to (but not including) 1.0, which we can write as $[0, 1)$ . This range represents the entire universe of possible messages that could be encoded.

Subdividing the Range by Probability

The next step is to divide this initial range into smaller sub-ranges, one for each possible symbol in our alphabet. Crucially, the size of each symbol's sub-range is directly proportional to its probability of occurrence.

A symbol with a high probability (like 'e' in English) will be allocated a large sub-range.
A symbol with a low probability (like 'z') will be allocated a very small sub-range.

This initial partitioning of the $[0, 1)$ range forms the basis of our statistical model, which both the encoder and decoder must share.

Encoding as a Zooming Process

As the algorithm processes the input message symbol by symbol, it successively narrows down the range.

When the first symbol of the message is read, the algorithm selects the sub-range corresponding to that symbol. This sub-range now becomes the new "current range".
For the second symbol, the algorithm takes this new, smaller current range and subdivides it in exactly the same proportions as the original $[0, 1)$ range.
It then selects the sub-sub-range corresponding to the second symbol. This becomes the next current range.
This process continues, with each symbol causing the algorithm to "zoom in" on a smaller and smaller portion of the number line.

After the last symbol has been processed, the algorithm is left with a final, extremely narrow range of numbers. Any number that falls within this final range uniquely identifies the original message. The output of the encoder is simply a binary representation of any such number, chosen to be as short as possible while still lying within that final range.

A Detailed Encoding Example

Let's walk through the encoding process with a concrete example. We will encode the word "CACAO".

Step 0: Establish the Probability Model

First, we need the probabilities for each character in our alphabet. For this example, let's assume the following model:

Symbol	Probability	Initial Range
'C'	0.3	[0.0, 0.3)
'A'	0.2	[0.3, 0.5)
'O'	0.5	[0.5, 1.0)

We will track our current range using two variables: $\text{low}$ and $\text{high}$ . Initially, $\text{low} = 0.0$ and $\text{high} = 1.0$ . The current range width is $\text{range} = \text{high} - \text{low}$ .

Step 1: Encode the first symbol, 'C'

The first symbol is 'C'. According to our model, the range for 'C' is [0.0, 0.3). We select the corresponding sub-range within the current range (scaling according to the current range width).

Current range: $[0.0, 1.0)$ , so $\text{range} = 1.0 - 0.0 = 1.0$ .
The new $\text{low}$ bound becomes the old low bound plus the range multiplied by the start of C's range (cumulative start in 0–1 scale): $0.0 + (1.0 \times 0.0) = 0.0$ .
The new $\text{high}$ bound becomes the old low bound plus the range multiplied by the end of C's range (cumulative end in 0–1 scale): $0.0 + (1.0 \times 0.3) = 0.3$ .
New Current Range: [0.0, 0.3)

Step 2: Encode the second symbol, 'A'

Now we work entirely within our new range [0.0, 0.3). We subdivide this new, smaller range using the same original proportions, that is, we take the cumulative start and end of the symbol in 0–1 scale and scale them relative to the current width.

Current range: $[0.0, 0.3)$ , so $\text{range} = 0.3 - 0.0 = 0.3$ .
The range for 'A' is [0.3, 0.5). This means that in 0–1 scale 'A' starts at 0.3 and ends at 0.5; we apply these values, multiplying by the current range width and adding to the low bound.
The new $\text{low}$ bound: $0.0 + (0.3 \times 0.3) = 0.0 + 0.09 = 0.09$ .
The new $\text{high}$ bound: $0.0 + (0.3 \times 0.5) = 0.0 + 0.15 = 0.15$ .
New Current Range: [0.09, 0.15)

Step 3, 4, 5: Repeat for 'C', 'A', 'O'

The process continues for the remaining characters, each time taking the cumulative start and end of the symbol (in 0–1 scale), multiplying by the current width and adding to the current low bound, to obtain the new sub-range.

After 'C': Range becomes $[0.09, 0.108)$ . (calculations: width = 0.15 - 0.09 = 0.06; for 'C' [0.0,0.3): low = 0.09 + 0.06*0.0 = 0.09; high = 0.09 + 0.06*0.3 = 0.09 + 0.018 = 0.108)
After 'A': Range becomes $[0.0954, 0.099)$ . (calculations: width = 0.108 - 0.09 = 0.018; for 'A' [0.3,0.5): low = 0.09 + 0.018*0.3 = 0.09 + 0.0054 = 0.0954; high = 0.09 + 0.018*0.5 = 0.09 + 0.009 = 0.099)
After 'O': Final Range: [0.0972, 0.099). (calculations: width = 0.099 - 0.0954 = 0.0036; for 'O' [0.5,1.0): low = 0.0954 + 0.0036*0.5 = 0.0954 + 0.0018 = 0.0972; high = 0.0954 + 0.0036*1.0 = 0.0954 + 0.0036 = 0.099)

Step 6: Finalizing and Outputting the Code

Our entire message "CACAO" is now represented by the final range [0.0972, 0.099). We can choose any number within this range as our encoded message. In practice, we choose the number whose binary fractional representation is shortest among numbers falling within this range.

For example, the number 0.09765625 is within this range. Its binary fractional representation is $0.00011001$ . This binary string is the compressed output.

The Decompression Process: Reversing the Zoom

The decompression process is the mirror image of encoding. The decoder must start with the exact same initial probability model. It receives the single encoded number (e.g., 0.09765625).

Decode Symbol 1: The decoder looks at its initial range [0, 1). It checks which symbol's sub-range the value 0.09765625 falls into. It falls within [0.0, 0.3), so the first symbol is 'C'.
Update Range: The decoder now knows the first symbol was 'C', so it updates its current range to [0.0, 0.3), exactly as the encoder did.
Renormalize Value and Decode Symbol 2: To find the next symbol, the decoder remaps the received value to this new range. It calculates $(0.09765625 - 0.0) / 0.3 \approx 0.3255$ . It then checks where this new value falls in the original model. 0.3255 falls within the range for 'A' [0.3, 0.5). The second symbol is 'A'.
Repeat: The process continues. The decoder zooms into A's range, renormalizes the value again, discovers that it corresponds to 'C', and so on, until the entire message "CACAO" is reconstructed perfectly.

Arithmetic Coding vs. Huffman Coding: A Deeper Comparison

Now we can fully appreciate why arithmetic coding is theoretically superior to Huffman coding.

Aspect	Huffman Coding	Arithmetic Coding
Bit Allocation	Assigns a fixed, integer number of bits to each symbol.	Effectively assigns a fractional number of bits per symbol, amortized over the whole message.
Optimality	Optimal for prefix codes, but can only approximate the true entropy limit due to the integer bit constraint.	Consistently achieves compression ratios that are much closer to, or at, the theoretical Shannon entropy limit.
Model & Coder Separation	The statistical model (the tree) and the coding process are tightly intertwined. Changing the model requires rebuilding the tree.	There is a clean separation between the statistical model and the coding mechanism. The same arithmetic coder can be used with different, highly complex adaptive models (like Markov models).
Speed and Complexity	Very fast. It involves simple counting, sorting, and tree traversal.	Slower. It requires high-precision arithmetic (multiplication and division) at each step.

In essence, the number of bits required to encode a symbol with arithmetic coding is $-\log_2(P(\text{symbol}))$ , which directly reflects its information content. Huffman coding is forced to round this to the nearest whole number. This separation of the statistical model from the bit-encoding engine is a major advantage of arithmetic coding, making it highly flexible and adaptable.

Practical Realities and Modern Use

Despite its theoretical superiority, arithmetic coding took longer to be widely adopted than Huffman coding. This was partly due to its higher computational complexity, but also significantly due to patent issues. Key aspects of the algorithm were patented by companies like IBM and AT&T in the 1980s, which discouraged its use in open standards and free software.

With the expiration of these patents in the 2000s, arithmetic coding has become a vital component of many state-of-the-art compression standards where achieving the best possible compression is paramount. It is a key part of:

Modern Image Compression: It is a core component of the JPEG2000 image standard.
High-Efficiency Video Coding: It is used in the context-adaptive binary arithmetic coding (CABAC) entropy coding scheme within the H.264/AVC and H.265/HEVC video standards, which power most modern video streaming.
Other Specialized Compressors: It is used in various other high-performance data compressors where computational cost is secondary to compression ratio.

Arithmetic coding remains a beautiful example of how a profound mathematical idea can provide a near-perfect solution to a practical engineering problem. By reimagining an entire message as a single point in a numerical space, it offers the most efficient possible expression of data according to its statistical properties, solidifying its place as one of the most important algorithms in the history of data compression.