LZW Algorithm

Dictionary-based compression used in GIF images and Unix compress.

A New Approach to Redundancy: Beyond Single Symbols

In our exploration of lossless compression, we have examined statistical methods like Huffman coding, which brilliantly reduce data size by assigning shorter codes to more frequent symbols. This approach is powerful, but it focuses on the probability of individual characters. However, real-world data contains a different, more structured type of redundancy: the repetition of entire sequences, phrases, and patterns. Consider the English sentence, "The quick brown fox jumps over the lazy dog"; the word "the" appears twice. A statistical method would compress each instance based on the high frequency of the letters t, h, and e, but it would miss the opportunity to recognize that the entire three-character sequence has reappeared.

To exploit this kind of structural redundancy, we need a different class of algorithms known as . The philosophy behind these methods is simple and intuitive: why write out the same long phrase again and again when you can simply point to the first time you wrote it? One of the most elegant and historically significant algorithms in this family is the Lempel-Ziv-Welch, or LZW, algorithm.

The LZW Algorithm: Building a Dictionary On the Fly

Developed by Terry Welch in 1984, LZW is an enhancement of the earlier LZ78 algorithm created by Abraham Lempel and Jacob Ziv. Its genius lies in its ability to build a dictionary of recurring phrases adaptively, as it processes the input data. This means it requires only a single pass over the data and, crucially, does not need to transmit the dictionary itself alongside the compressed output. The decompressor is able to perfectly reconstruct the exact same dictionary from the compressed data alone.

The Core Mechanism

The LZW algorithm maintains a string table, or dictionary, which maps sequences of characters to numeric codes. The process works as follows:

Initialize the Dictionary: The dictionary is pre-populated with all possible single-character strings. For standard 8-bit data, this means the dictionary initially contains 256 entries, mapping each character from code 0 to 255.
Process the Input Stream: The algorithm reads the input character by character, attempting to find the longest possible sequence that already exists in its dictionary.
The Main Loop: Let us say the longest sequence found so far is P. The algorithm reads the next character, C. It then checks if the combined sequence P + C exists in the dictionary.
- If YES: The sequence is recognized. The algorithm updates P to be P + C and waits for the next character, continuing its search for an even longer known sequence.
- If NO: The end of a recognized sequence has been reached. The algorithm performs two actions:
  1. It outputs the numeric code for the last known sequence, P.
  2. It adds the new, unrecognized sequence P + C to the dictionary with the next available numeric code.
Reset and Repeat: The algorithm then resets, starting a new search with the single character C as the new initial sequence P. This loop continues until the entire input stream has been processed.

LZW Encoding: A Detailed Walkthrough

The abstract rules become much clearer with a step-by-step example. Let's compress the string TOBEORNOTTOBEORTOBEORNOT.

Step 0: Initialize the Dictionary

Our dictionary begins with entries for each character in our alphabet. We will assign the next available code starting at 256. For simplicity, let us just list the characters we will use:

T = code 84, O = code 79, B = code 66, E = code 69, R = code 82, N = code 78

Encoding Process Table

We will track the state of the encoder step by step.

Current Sequence (P)	Next Char (C)	P + C in Dictionary?	Output Code	New Dictionary Entry
T	O	No	84 (T)	256: TO
O	B	No	79 (O)	257: OB
B	E	No	66 (B)	258: BE
E	O	No	69 (E)	259: EO
O	R	No	79 (O)	260: OR
R	N	No	82 (R)	261: RN
N	O	No	78 (N)	262: NO
O	T	No	79 (O)	263: OT
T	T	No	84 (T)	264: TT
T	O	Yes (256)	-	-
TO	B	No	256 (TO)	265: TOB
B	E	Yes (258)	-	-
BE	O	No	258 (BE)	266: BEO
O	R	Yes (260)	-	-
OR	T	No	260 (OR)	267: ORT
T	O	Yes (256)	-	-
TO	B	Yes (265)	-	-
TOB	E	No	265 (TOB)	268: TOBE
E	O	Yes (259)	-	-
EO	R	No	259 (EO)	269: EOR
R	N	Yes (261)	-	-
RN	O	No	261 (RN)	270: RNO
O	T	Yes (263)	-	-
OT	end	-	263 (OT)	-

Final Result: The original 26-character string has been encoded into a sequence of 18 codes: [84, 79, 66, 69, 79, 82, 78, 79, 84, 256, 258, 260, 265, 259, 261, 263]. Initially, the output is longer in terms of items, but as the dictionary grows with repeating patterns, the codes begin to represent longer and longer sequences, leading to significant compression. The codes themselves would be represented in binary with a variable number of bits; as the dictionary grows past 511 entries, for example, the codes would switch from 9-bit to 10-bit numbers.

The LZW Decompression Process

The elegance of LZW is fully revealed during decompression. The decoder receives the sequence of codes and, starting with the same initial dictionary of single characters, perfectly reconstructs the original text. It does this by inferring the new dictionary entries from the incoming codes.

The general rule is: for each code received, output the string it represents from the current dictionary. Then, create a new dictionary entry by concatenating the string you just output with the first character of the next string you are about to output.

Input Code	Output String	New Dictionary Entry
84	T	...
79	O	256: T + O = TO
66	B	257: O + B = OB
69	E	258: B + E = BE
...	...	... continues ...
263	OT	... (end of stream)

There is a special case (the KWKWK or string-table-string problem) where the decoder receives a code that is not yet in its dictionary. This only happens when a sequence is created and then immediately used by the encoder. The decoder can handle this by knowing that the unknown character must be the same as the first character of the previously output string. This ensures the lock-step synchronization between encoder and decoder is never broken.

Practical Application: The GIF Image Format

LZW became widely known through its implementation in the GIF (Graphics Interchange Format). It is particularly well-suited for this format for several reasons:

Limited Color Palette: GIF images are limited to a maximum of 256 colors. The LZW algorithm does not operate on the raw color values but on the 8-bit index values that point to a color in the palette. This means the initial dictionary size is small and fixed.
Spatial Redundancy: Computer-generated images, like logos or diagrams, often have large areas of solid color or repeating patterns. When the image data is read line by line, these spatial patterns turn into long, repetitive sequences of color indices in the input stream, which LZW can compress very effectively.

Strengths, Weaknesses, and Legacy

LZW's legacy in the world of compression is significant, but it is also important to understand its trade-offs.

Strengths

One-Pass Algorithm: It does not need to analyze the entire file beforehand. It builds its model adaptively, making it suitable for streaming applications.
No Dictionary Transmission: The decompressor's ability to rebuild the dictionary is a major advantage, eliminating significant overhead.
Relative Simplicity and Speed: The algorithm is conceptually straightforward and generally fast, particularly during decompression.

Weaknesses

Full Dictionary Problem: The dictionary has a fixed size (e.g., 4096 entries for 12-bit codes). If the input data is very diverse, the dictionary can fill up. Different implementations handle this in different ways: some stop adding new entries, some reset the dictionary, and some use more complex pruning strategies. This can affect compression efficiency.
Sensitivity to Errors: A single bit error in the compressed code stream can cause the decompressor's dictionary to become desynchronized from the encoder's, leading to catastrophic failure of the decompression from that point forward.
Patent History: For many years, the LZW algorithm was protected by patents, most notably one held by Unisys. This led to controversy and licensing issues, particularly for the GIF format, and spurred the development of patent-free alternatives like the PNG format, which uses the Deflate algorithm (a combination of LZ77 and Huffman coding).

Though now free from patent restrictions, LZW has been largely succeeded by more advanced LZ variants and other algorithms like Deflate. However, its historical importance is undeniable, and its core principles of adaptive dictionary-based compression remain a cornerstone of lossless data reduction.