Data Compression - Throwing Away What You Can Reconstruct // Megha Bose

Helpful context:

Shannon’s source coding theorem gives you the floor: no lossless code can compress a source $X$ below $H(X)$ bits per symbol on average. But knowing the floor exists is different from knowing how to build a ladder that reaches it. In 1948, the proof was existential - there are codes that get close, but the construction was not practical. The history of data compression since then is the story of building practical codes that approach, and in some cases effectively reach, Shannon’s limit.

The story splits early. Huffman coding, invented in 1952, gets within 1 bit per symbol of entropy - not bad, but that overhead is significant for short messages and small alphabets. Arithmetic coding, developed in the 1970s and 1980s, eliminates most of that gap by coding entire messages rather than individual symbols. Lempel and Ziv showed in 1977 and 1978 that you do not even need to know the source distribution - you can learn it on the fly and still converge to the entropy rate. These three threads - optimal symbol codes, range coding, and universal coding - are the foundations of every modern compression format.

The second half of the story is lossy compression, where you are willing to accept some distortion in exchange for much higher compression ratios. JPEG images, MP3 audio, and H.265 video all discard information you were not going to notice anyway. The theory behind this trade-off - when you can discard, and how much - is rate-distortion theory. But understanding lossless compression is the prerequisite.

Why Huffman Leaves a Gap

Huffman coding assigns each source symbol an integer-length codeword. If $p(x)$ is the probability of symbol $x$, the ideal codeword length is $-\log_2 p(x)$ bits. But codeword lengths must be integers, so the actual length is $\lceil -\log_2 p(x) \rceil$, which rounds up by at most 1 bit.

For a single symbol, the expected length satisfies

$$H(X) \leq L_{\text{Huffman}} < H(X) + 1.$$

The upper bound of $H(X) + 1$ is tight. Consider a source with two symbols: $p(a) = 0.01$, $p(b) = 0.99$. Entropy is $H \approx 0.081$ bits, but any binary Huffman code assigns at least 1 bit per symbol - nearly a factor of 12 overhead.

The fix is to code blocks. If you encode $n$ symbols at a time, treating sequences of $n$ symbols as a single “super-symbol,” the overhead per symbol drops to less than $1/n$ bits. In the limit $n \to \infty$, Huffman codes on blocks approach $H(X)$. But the alphabet of $n$-symbol blocks has size $|\mathcal{X}|^n$, so the table size explodes. Arithmetic coding avoids this by working directly on infinite-length sequences.

Arithmetic Coding: Approaching the Limit

Arithmetic coding encodes an entire message as a single number in $Mutual Information & Capacity - What One Variable Tells You About Another

Rate-Distortion Theory - The Fundamental Tradeoff Between Size and Fidelity