Information Theory

Beyond the Bits: Quantifying Information Itself

In our exploration of data compression, we have established the fundamental difference between lossless and lossy techniques. Lossless methods perfectly preserve every piece of original data by cleverly removing redundancy, while lossy methods achieve greater size reduction by discarding data deemed unimportant. This raises a profound question: Is there a way to measure the actual amount of information contained within a piece of data? Can we determine a theoretical limit, a point beyond which no lossless compression algorithm can possibly shrink a file further?

The answer is a resounding yes, and it comes from a field of mathematics and computer science known as Information Theory. Pioneered by Claude E. Shannon in the 1940s, this theory provides the essential mathematical tools to analyze and quantify information. It allows us to move from an intuitive understanding of compression to a precise, scientific one. To understand the limits of compression, we must first learn how to measure information. Information Theory teaches us that information is fundamentally a measure of unpredictability or, more simply, surprise.

Self-Information: Measuring the Surprise of a Single Event

Before we can measure the information in an entire file or message, we need to start with the smallest component: a single event or symbol. The measure for this is called .

The Intuition Behind Surprise

Imagine two statements about tomorrow's weather:

The sun will rise in the east.
A rare tropical storm will suddenly form and bring record snowfall.

The first statement is completely predictable. It is a certain event. Hearing it provides you with essentially zero new information. The second statement is extremely unlikely and shocking. Hearing it conveys a massive amount of new information. This intuitive relationship is the core of self-information: the amount of information an event carries is inversely related to its probability of occurring. Rare events are surprising and thus informative. Common events are predictable and thus uninformative.

The Mathematical Formulation

Shannon captured this relationship in a simple, elegant formula. The self-information $I(x)$ of an event $x$ with probability $P(x)$ is defined as:

I(x) = \log_2\left(\frac{1}{P(x)}\right) = -\log_2(P(x))

Let us break down this formula to understand why it works so perfectly:

$P(x)$ : This is the probability of the event $x$ happening. It is a number between 0 (impossible) and 1 (certain).
$\frac{1}{P(x)}$ : This term directly captures the idea of surprise. If an event is very likely (e.g., $P(x) = 0.99$ ), this value is small (approx 1.01). If an event is very unlikely (e.g., $P(x) = 0.0001$ ), this value is very large (10,000).
$\log_2$ : The logarithm base 2 is used to measure information in the most fundamental unit: the bit. A logarithm answers the question: "To what power must we raise the base (2) to get our number?" In this context, it scales the "surprise" value into a more manageable measure. It also has the beautiful property of making information additive: the information of two independent events combined is the sum of their individual informations.
The negative sign ( $-$ ): Probabilities are always between 0 and 1, and the logarithm of a number in this range is always negative or zero. The negative sign in the formula simply flips the result to a positive number, which is more intuitive for measuring a quantity like information.

Example 1: The Fair Coin Toss

Let us apply this to the simplest possible scenario with uncertainty: a single toss of a fair coin.

There are two possible outcomes: Heads (H) and Tails (T).
The probability of getting heads is $P(H) = 0.5$ .
The probability of getting tails is $P(T) = 0.5$ .

Now, let's calculate the self-information for observing heads:

I(H) = -\log_2(P(H)) = -\log_2(0.5) = -\log_2\left(\frac{1}{2}\right) = -(-1) = 1 \text{ bit}

Unsurprisingly, the self-information for tails is also 1 bit. This result is profound: it tells us that resolving the uncertainty of a perfectly balanced, two-outcome event provides exactly one bit of information. This perfectly aligns the mathematical definition of information with the fundamental unit of digital data.

Example 2: The Biased Coin Toss

Now, let's consider a biased or "unfair" coin, where one outcome is much more likely than the other.

Let's say the coin is weighted to land on Tails most of the time.
Probability of Heads: $P(H) = 0.125$ (or 1/8).
Probability of Tails: $P(T) = 0.875$ (or 7/8).

Let's calculate the self-information for each outcome:

I(H) = -\log_2(0.125) = -\log_2\left(\frac{1}{8}\right) = -(-3) = 3 \text{ bits}

I(T) = -\log_2(0.875) \approx -(-0.1926) \approx 0.1926 \text{ bits}

This example beautifully illustrates the concept of information as surprise. Observing the rare event (Heads) provides a significant amount of information (3 bits). Conversely, observing the highly predictable event (Tails) provides very little new information (about 0.19 bits). We already expected it to be tails, so confirming that fact is not very informative.

: The Average Information of a Data Source

Self-information is powerful, but it only describes a single event. A data file, like a text document or an image, is a long sequence of many symbols (characters, pixel values). To understand the potential for compressing the entire file, we need to know the average information per symbol. This measure is called entropy.

Entropy is essentially the expected value of the self-information of a source. It answers the question: "On average, how much surprise or information does each symbol produced by this source carry?"

The Mathematical Formulation of Entropy

The entropy of a data source $X$ , which produces a set of symbols $\{x_1, x_2, \ldots, x_n\}$ with corresponding probabilities $P(x_i)$ , is denoted by $H(X)$ and calculated by summing the self-information of each symbol, weighted by its probability of occurrence:

H(X) = \sum_{i=1}^{n} P(x_i) \cdot I(x_i) = -\sum_{i=1}^{n} P(x_i) \cdot \log_2(P(x_i))

The unit of entropy is bits per symbol. It gives us a single, powerful number that characterizes the average unpredictability and information content of the entire source.

Example 1: The Entropy of a Fair Coin

Using our fair coin example, where $P(H)=0.5$ , $I(H)=1$ bit, and $P(T)=0.5$ , $I(T)=1$ bit:

H(X) = (P(H) \cdot I(H)) + (P(T) \cdot I(T)) = (0.5 \cdot 1) + (0.5 \cdot 1) = 1 \text{ bit/symbol}

The entropy is 1 bit per symbol. This is the maximum possible entropy for a source with two outcomes. It means the source is completely random and unpredictable. There is no redundancy to exploit.

Example 2: The Entropy of a Biased Coin

For our biased coin, with $P(H)=0.125$ , $I(H)=3$ bits, and $P(T)=0.875$ , $I(T) \\approx 0.1926$ bits:

H(X) = (P(H) \cdot I(H)) + (P(T) \cdot I(T)) = (0.125 \cdot 3) + (0.875 \cdot 0.1926) \approx 0.375 + 0.1685 \approx 0.544 \text{ bits/symbol}

The entropy here is much lower, about half a bit per symbol. This low value quantifies the source's predictability. Even though a rare event carries a lot of information, it happens so infrequently that the average information content of the source is low. This indicates significant redundancy, which means there is a large potential for lossless compression.

Entropy as the Ultimate Limit of Lossless Compression

This brings us to the most crucial conclusion of information theory for data compression. Shannon's Source Coding Theorem proves that the entropy $H(X)$ of a source establishes a fundamental, unbreakable lower bound for lossless compression.

It is impossible to losslessly compress data from a source

X

such that the average number of bits per symbol in the compressed output is less than the entropy

H(X)

of the source.

Entropy, therefore, represents the true, irreducible information content of a data source. Any part of the data representation beyond its entropy is redundancy. For example, English text is typically stored using 8 bits per character (ASCII/Unicode). However, its entropy is much lower, estimated to be around 1 to 1.5 bits per character due to statistical patterns. The difference between the 8 bits we use and the ~1.5 bits of true information is the redundancy that compression algorithms like ZIP can remove. The entropy sets the theoretical goal: we know we can never losslessly compress English text to be, on average, smaller than ~1.5 bits per character.