Entropy and Redundancy

Measuring information content and data redundancy in digital communications.

Introduction: The Currency of Compression

Having established the two great domains of data compression, lossless and lossy, we now delve deeper into the engine that drives lossless compression. We know that lossless algorithms work by perfectly reconstructing the original data. This implies they are not throwing information away, but rather finding a more efficient way to represent it. This efficiency comes from one key resource: redundancy.

Redundancy is the "fat" in the data, the predictable and repetitive parts that can be squeezed out without losing the essential substance. The true, essential substance of the data is its information content. The mathematical field of Information Theory, pioneered by Claude Shannon, provides us with the tools to precisely measure both of these quantities. In this section, we will explore the concepts of Entropy, which measures the true information, and Redundancy, which measures the compressible waste. Understanding the relationship between these two is the key to understanding why, how, and by how much any piece of data can be losslessly compressed. Without redundancy, lossless compression is impossible.

A Deeper Look at : The Measure of True Information

As introduced previously, entropy is the average information content per symbol of a data source. It quantifies the level of uncertainty or randomness inherent in the data. Let us revisit the formula and explore its properties to gain a more profound understanding. The entropy H(X)H(X) of a source XX that produces symbols xix_i with probabilities P(xi)P(x_i) is given by:

H(X)=i=1nP(xi)log2(P(xi))H(X) = -\sum_{i=1}^{n} P(x_i) \cdot \log_2(P(x_i))

Key Properties of Entropy

This formula has several important properties that align perfectly with our intuitive understanding of information and uncertainty:

  • Maximum Entropy: For a source that can produce a fixed number of nn different symbols, the entropy is at its maximum when all symbols are equally likely to occur. In this state of maximum unpredictability, each symbol has a probability of P(xi)=1/nP(x_i) = 1/n. The maximum entropy is then Hextmax(X)=log2(n)H_{ ext{max}}(X) = log_2(n). A source at maximum entropy is like a perfectly random sequence; it contains no statistical patterns and is therefore the hardest to compress.
  • Minimum Entropy (Zero Entropy): The entropy is zero if and only if there is no uncertainty. This happens when one symbol is certain to occur (probability of 1) and all other symbols are impossible (probability of 0). A message from such a source is completely predictable. Since there is no surprise, there is no information. For example, a source that only ever produces the letter 'A' has an entropy of zero.
  • Non-negativity: The value of entropy is always zero or positive. It is not possible to have a negative amount of information.

Case Study: An Alphabet of Four Symbols

Let's consider a simple data source that generates messages using only four possible symbols: A, B, C, and D.

Scenario 1: Maximum Uncertainty

Assume the source is completely random, meaning each symbol has an equal chance of appearing:

P(A)=P(B)=P(C)=P(D)=0.25P(A) = P(B) = P(C) = P(D) = 0.25

The maximum entropy for this four-symbol source is:

Hmax=log2(4)=2 bits/symbolH_{\text{max}} = \log_2(4) = 2 \text{ bits/symbol}

A simple, fixed-length code to represent these symbols would be A=00, B=01, C=10, D=11. The average length of this code is exactly 2 bits per symbol. Since this average length equals the entropy, this coding is perfectly optimal. There is no redundancy to remove; the source is already as compact as it can possibly be.

Scenario 2: Introduced Predictability

Now, let's assume the source is biased. Certain symbols are more likely than others, introducing a pattern and some predictability.

P(A)=0.5,P(B)=0.25,P(C)=0.125,P(D)=0.125P(A) = 0.5, P(B) = 0.25, P(C) = 0.125, P(D) = 0.125

The entropy for this biased source is calculated as the weighted average of the self-information of each symbol:

H(X)=(0.5log20.5+0.25log20.25+0.125log20.125+0.125log20.125)H(X) = -(0.5 \log_2 0.5 + 0.25 \log_2 0.25 + 0.125 \log_2 0.125 + 0.125 \log_2 0.125)H(X)=(0.5(1)+0.25(2)+0.125(3)+0.125(3))H(X) = -(0.5(-1) + 0.25(-2) + 0.125(-3) + 0.125(-3))H(X)=(0.50.50.3750.375)=1.75 bits/symbolH(X) = -(-0.5 - 0.5 - 0.375 - 0.375) = 1.75 \text{ bits/symbol}

Notice that the entropy (1.75 bits/symbol) is now less than the maximum possible entropy (2 bits/symbol). This drop in entropy is a direct measure of the predictability we introduced. If we continue to use our simple 2-bit code (A=00, B=01, etc.), we are now using more bits on average than is theoretically necessary. The gap between the 2 bits we are using and the 1.75 bits of true information content is the redundancy.

: The Measure of Waste

Redundancy is the "other side of the coin" to entropy. While entropy measures the essential, unpredictable information, redundancy measures the predictable, repetitive, and ultimately compressible portion of the data. It is the difference between how data is currently represented and the most compact possible representation defined by its entropy.

Quantifying Redundancy

We can quantify redundancy both in absolute and relative terms. Let us consider a data source XX with entropy H(X)H(X), using an encoding scheme where the average number of bits per symbol is LavgL_{\text{avg}}. For a simple fixed-length code with nn symbols, this average length is Lavg=log2(n)L_{\text{avg}} = \log_2(n), which is equal to the maximum possible entropy Hmax(X)H_{\text{max}}(X).

  • Absolute Redundancy (RR): This is the average number of wasted bits per symbol.
    R=LavgH(X)R = L_{\text{avg}} - H(X)
  • Relative Redundancy (rr): This expresses the wasted portion as a percentage of the total data size, providing a more intuitive measure of compressibility.
    r=RLavg=LavgH(X)Lavg=1H(X)Lavgr = \frac{R}{L_{\text{avg}}} = \frac{L_{\text{avg}} - H(X)}{L_{\text{avg}}} = 1 - \frac{H(X)}{L_{\text{avg}}}

The ultimate goal of any lossless compression algorithm is to create a new representation of the data where LavgL_{\text{avg}} is as close as possible to H(X)H(X), which is equivalent to driving the redundancy RR as close as possible to zero.

Example Revisited: Redundancy in Our Four-Symbol Alphabet

Let's apply these formulas to Scenario 2 of our four-symbol alphabet, where we used a fixed-length 2-bit code (Lavg=2)(L_{\text{avg}} = 2) and calculated the entropy to be H(X)=1.75H(X) = 1.75 bits/symbol.

  • Absolute Redundancy: R=21.75=0.25 bits/symbolR = 2 - 1.75 = 0.25 \text{ bits/symbol}. On average, we are wasting a quarter of a bit for every symbol we encode.
  • Relative Redundancy: r=11.752=10.875=0.125r = 1 - \frac{1.75}{2} = 1 - 0.875 = 0.125, or 12.5%. This tells us that 12.5% of our encoded data is pure redundancy, which could theoretically be removed. This value also represents the maximum potential size reduction we could achieve with a perfect lossless compression algorithm for this data source.

Sources and Types of Redundancy in Real-World Data

The simple statistical models we have used so far are useful for understanding the theory. However, in real data, redundancy is far more complex and appears in many forms.

Redundancy in Text

  • Statistical Redundancy: The most obvious type. As noted, letters and words do not appear with equal frequency. In English, common letter sequences (digrams) like "th," "he," "in," and common words like "the," "and," "a" create immense statistical redundancy. A compression algorithm can exploit this by using shorter codes for these common elements.
  • Syntactic Redundancy: The rules of grammar make language predictable. A noun is likely to be followed by a verb. An adjective is likely to precede a noun. After the word "the," the probability of the next word being another "the" is virtually zero. This grammatical structure limits the possible valid sequences, creating another layer of redundancy.
  • Semantic Redundancy: The meaning of the text adds yet another layer of predictability. Consider the sentence: "Roses are red, violets are...". The last word is highly predictable ("blue") due to the semantic context of the common nursery rhyme. This highly probable word carries very little new information (low self-information) and contributes to the overall redundancy of the message.

Redundancy in Images and Video

  • Spatial Redundancy: Within a single image or video frame, neighboring pixels are highly correlated. A pixel in the middle of a blue sky is extremely likely to be the same shade of blue as the pixels surrounding it. This spatial predictability is a huge source of redundancy that algorithms like RLE, or the predictive coding used in the PNG format, are designed to eliminate.
  • Temporal Redundancy: This is unique to video and audio. In a video stream, consecutive frames are often almost identical. In a scene with a person talking against a static background, the only thing that changes frame-to-frame are the pixels around the person's mouth and eyes. Temporal redundancy is the single largest source of "waste" in raw video data, and its removal through techniques like motion compensation is the primary job of video codecs like H.264.
  • Spectral Redundancy: In color images, the Red, Green, and Blue (RGB) components are often strongly related. For example, in a grayscale image, R=G=BR=G=B. Knowing the value of one color channel provides a lot of information about the likely values of the others.

The Role of Data Models

To effectively exploit these complex forms of redundancy, compression algorithms use data models. A model is a set of rules and statistical data that attempts to describe the structure of the source information. A better model can identify deeper patterns and thus enable better compression.

  • Memoryless Models: The simple entropy calculation we performed assumes the source is "memoryless," meaning the probability of a symbol does not depend on previous symbols. This captures only basic statistical redundancy.
  • Markov Models: These are more sophisticated "sources with memory." They capture syntactic redundancy by considering the context. For example, a first-order Markov model calculates the probability of the next character based on the immediately preceding character. This leads to the concept of conditional entropy, which is the entropy of a source given that we know its context. Conditional entropy is always lower than or equal to simple entropy, and this difference quantifies the additional redundancy captured by the more advanced model. Modern compression algorithms like PPM (Prediction by Partial Matching) use sophisticated context models to achieve very high compression ratios for text.

In summary, entropy provides the theoretical lower limit for lossless compression by quantifying the true, irreducible information in a data source. Redundancy is the measurable "waste" in the data's representation that compression algorithms seek to eliminate. The more complex the redundancy, the more sophisticated the data model required to identify and remove it effectively.

    Entropy and Redundancy