Dictionary Compression

LZ77, LZ78 and modern variants like DEFLATE and LZ4.

A New Strategy for Finding Redundancy

In our exploration of lossless compression, we've encountered statistical methods like Huffman and Arithmetic coding, which achieve compression by focusing on the frequency of individual symbols. This approach is highly effective but overlooks a more structured, and often more significant, type of redundancy found in real-world data: the repetition of entire phrases, words, and byte sequences.

Consider a simple text file containing the sentence: "The quick brown fox jumps over the lazy dog." A statistical compressor would note the high frequency of characters like 'o' and the space, assigning them short codes. However, it would fail to capitalize on the fact that the entire word "the" appears twice. Intuitively, it seems inefficient to re-encode this common word character-by-character on its second appearance. Why not simply say "repeat the word we saw earlier"? This simple idea is the cornerstone of , arguably the most versatile and widely used family of lossless compression algorithms today. These algorithms, pioneered by Jacob Ziv and Abraham Lempel in the 1970s, have become the engines behind universal standards like ZIP, GZIP, and PNG.

The LZ77 Algorithm: Compression with a Sliding Window

Published in 1977, the LZ77 algorithm is the first of the two foundational works by Lempel and Ziv. Its genius lies in its simplicity and efficiency. Instead of building a massive, explicit dictionary of every phrase encountered, it uses a clever, constantly updating "memory" of the data it has just processed.

The Core Mechanism: The Sliding Window

The algorithm operates using a concept called a sliding window, which is a conceptual frame that moves over the input data. This window is divided into two parts:

Search Buffer: This is the "history" or the "implicit dictionary." It is a buffer of a fixed size (e.g., a few kilobytes) containing the most recently encoded data. The algorithm will search this buffer for repeated sequences.
Look-ahead Buffer: This is a smaller buffer containing the next block of data that is about to be encoded. The algorithm takes characters from this buffer and tries to find a match in the search buffer.

The LZ77 Encoding Process in Detail

The encoding proceeds in a continuous loop:

The algorithm looks at the characters at the beginning of the look-ahead buffer.
It then diligently searches the entire search buffer, from start to finish, to find the longest possible match for the sequence starting in the look-ahead buffer.
Based on the search result, it makes a decision and outputs one of two types of tokens:
- If NO match is found: (or the match is shorter than a minimum threshold, e.g., 3 characters), the algorithm outputs a literal token. This is typically just the uncompressed character from the look-ahead buffer, often preceded by a flag bit to distinguish it from a pointer.
- If a match IS found: The algorithm outputs a pointer token. This is not the data itself, but a compact reference that says, "go back and copy from the past." This pointer typically consists of three parts: a flag bit, an offset (or distance), and a length. For example: $(0, \text{distance}, \text{length})$ .
  - The distance indicates how many bytes to go back into the search buffer to find the start of the match.
  - The length indicates how many bytes to copy from that point.
Finally, the algorithm slides the window forward. It shifts the processed data (either the single literal character or the entire matched sequence of length $L$ ) from the look-ahead buffer into the search buffer, and new data from the input file is read to refill the look-ahead buffer. The process then repeats.

The Magic of Self-Overlapping Matches

A truly powerful feature of LZ77 is its ability to handle runs of repeating patterns with incredible efficiency. Consider the string "abababababab". After the initial "ab" has been processed and is in the search buffer, the look-ahead buffer sees "ababababab". The algorithm can find the sequence "ab" in the search buffer, but it can issue a pointer that says "go back 2 bytes and copy 10 bytes."

A naive decompressor might get confused, but the LZ77 decompressor works byte-by-byte. It goes back 2 bytes, copies 'a', then 'b'. By the time it needs to copy the third byte, the first 'a' it just wrote is now in the history, available to be copied. This allows the decompressor to replicate long runs using a single, short pointer, making LZ77 extremely effective for data with simple repetitions.

Decompression: Simple and Fast

The beauty of LZ77 lies in its fast and simple decompression. The decompressor also maintains a search buffer (which is just a history of what it has already written to the output). When it receives a token:

If it is a literal token, it simply appends the literal character to its output.
If it is a pointer token (distance, length), it goes back 'distance' characters in its own output and copies 'length' characters to the end.

This process is extremely fast as it involves only simple memory-to-memory copies, which is a major reason for the popularity of LZ77-based schemes.

The LZ78 and LZW Family: Using an Explicit Dictionary

The second foundational algorithm, LZ78 (published in 1978), and its highly successful successor, LZW (Lempel-Ziv-Welch, developed in 1984), take a different approach. Instead of using recent output as an implicit dictionary, they build an explicit table of phrases encountered so far.

The Core Mechanism: Building and Referencing a Phrasebook

We have covered the LZW algorithm in detail in its own dedicated section. To recap its core philosophy and contrast it with LZ77:

Initialization: The algorithm starts with a dictionary pre-filled with all possible single-character symbols.
Building Phrases: It reads the input and finds the longest sequence of characters that is currently present in the dictionary.
Output and Update: It outputs the dictionary code for that longest found sequence. Then, it creates a new dictionary entry consisting of that sequence plus the very next character from the input.

LZ77 vs. LZW: Key Differences in Approach

While both are dictionary coders, their strategies are distinct. LZ77 replaces repetitions with a (distance, length) pointer back into the raw data stream. LZW replaces repetitions with a single integer code that refers to an entry in a separately managed dictionary of phrases. This means LZW can be more compact if the exact same multi-character "words" appear often throughout the file, while LZ77 might be more effective at catching ad-hoc repetitions that do not necessarily form consistent "words".

Modern Hybrids: The Best of All Worlds

Today's most successful and widely used general-purpose lossless compressors are not pure implementations of these early algorithms. Instead, they are sophisticated hybrids that combine the phrase-matching power of dictionary coders with the statistical prowess of entropy coders.

DEFLATE: The Workhorse of the Internet (ZIP, GZIP, PNG)

The DEFLATE algorithm is arguably the most successful compression algorithm in history. It is the engine inside the ubiquitous ZIP and GZIP file formats and is also used for lossless compression in the PNG image format. DEFLATE is a brilliant combination of LZ77 and Huffman coding.

Stage 1: LZ77 Duplicate Elimination: The first stage of DEFLATE is a standard LZ77 compressor. It scans the data with a sliding window (typically 32KB) and replaces repeated strings with (length, distance) pointers. Any character that is not part of a repeated sequence is passed through as a literal, uncompressed character.
Stage 2: Huffman Entropy Coding: The output of the LZ77 stage is a mixed stream of literals and pointers. This stream is itself highly redundant. For example, certain match lengths are much more common than others, and certain literal characters (like 'e' and 'space') will be very frequent. DEFLATE exploits this by passing this mixed stream through a Huffman coding stage. It dynamically builds two Huffman trees: one to compress the literals and match lengths, and a second one to compress the match distances.

This two-stage approach is incredibly effective. The LZ77 stage handles large-scale, structural repetition, while the Huffman stage handles the small-scale statistical redundancy of the remaining data and the pointers themselves.

LZ4: The Pursuit of Ultimate Speed

While DEFLATE provides excellent compression, its search for the longest match can be time-consuming. In many modern applications, especially those dealing with massive amounts of in-memory data, the speed of decompression is the most critical factor. For this need, algorithms like LZ4 were developed.

LZ4 is an LZ77 variant that prioritizes speed above all else. It achieves decompression speeds that are often an order of magnitude faster than DEFLATE. It does this through several optimizations, including using a hash table to find potential matches very quickly (rather than scanning the whole search buffer) and having a very simple output format that is easy for modern CPUs to parse. Its compression ratio is not as high as that of more complex algorithms, but for scenarios like database compression, real-time file systems (e.g., ZFS, Btrfs), and loading game assets, its incredible speed is a game-changer.