Huffman Coding - Shorter Codes for More Frequent Symbols // Megha Bose

Helpful context:

Every time you send a file over the internet, open a JPEG, or decompress a ZIP archive, a version of the same idea is at work: some symbols in the data appear far more often than others, and it would be wasteful to give every symbol the same number of bits. The letter “e” appears in English text about twelve times as often as “z”. Why should both cost eight bits? The answer, developed by David Huffman as a graduate student in 1952, is that they shouldn’t - and that there is a precise, optimal way to assign variable-length codes that exploits this imbalance.

The remarkable thing about Huffman’s algorithm is how little machinery it requires. No dynamic programming table, no complex recurrence relation, no intricate bookkeeping. Just a priority queue and a single greedy rule, applied repeatedly: always merge the two least frequent things. The proof that this is optimal is equally clean. It rests on two elementary observations about binary trees that, taken together, leave no room for a better strategy. The algorithm and its proof are a small masterpiece of algorithmic thinking - simple enough to describe in a paragraph, rigorous enough to certify that nothing better is possible.

The connection to information theory is what gives the algorithm its depth. Shannon showed in 1948 that there is a fundamental limit on how compactly information can be encoded: the entropy $H(X) = -\sum_i p_i \log_2 p_i$ bits per symbol. No lossless code can do better. Huffman coding comes within one bit of this limit, and for symbols drawn from a known distribution it is optimal among all prefix-free codes. Understanding why requires tracing the path from Shannon’s theorem through the structure of binary trees to the greedy algorithm itself.

The Problem: Prefix-Free Variable-Length Codes

The goal is to represent a set of symbols $\{s_1, s_2, \ldots, s_n\}$ as binary strings - codewords - in a way that allows unambiguous decoding. If we use fixed-length codes, every symbol costs $\lceil \log_2 n \rceil$ bits. For an alphabet of 256 symbols this is 8 bits each, regardless of how often each symbol appears.

Variable-length codes can do better, but they introduce a problem. Suppose A maps to “0”, B maps to “01”, and C maps to “1”. Then the encoded string “001” could be decoded as “A A C” or “A B” - there is no way to tell where one codeword ends and the next begins without reading ahead. The decoder needs separators, which waste bits, or some other convention.

The elegant solution is a prefix-free code: no codeword is a prefix of any other codeword. If “0” is a codeword, then “01” and “011” cannot be codewords. This means that as soon as the decoder has seen enough bits to match a codeword, it knows unambiguously that this codeword has ended and the next one begins. No separator, no lookahead, no ambiguity. The decoder simply walks down a binary tree from the root, going left on “0” and right on “1”, until it reaches a leaf. The leaf identifies the symbol. It then returns to the root and continues. Because no codeword is a prefix of another, no leaf is an ancestor of another leaf, so the walk is always unambiguous.

This binary tree representation is not just a convenient picture - it is the correct mathematical object. Every prefix-free binary code corresponds to a binary tree where the leaves are labeled with symbols and the paths from root to leaf define the codewords. The length of symbol $s_i$’s codeword is the depth of the corresponding leaf. The average code length is:

$$L = \sum_{i=1}^n p_i \cdot \ell_i$$

where $p_i$ is the probability (or relative frequency) of $s_i$ and $\ell_i$ is the depth of its leaf. We want to minimize $L$.

Why Entropy Is the Limit

Shannon’s source coding theorem says that for any prefix-free code, $L \geq H(X)$, where the entropy is:

$$H(X) = -\sum_{i=1}^n p_i \log_2 p_i$$

The intuition is this: a symbol with probability $p_i$ carries $-\log_2 p_i$ bits of “surprise”. A codeword of length $\ell_i$ uses exactly $\ell_i$ bits. The prefix-free constraint - no codeword is a prefix of another - is equivalent, via the Kraft inequality, to:

$$\sum_{i=1}^n 2^{-\ell_i} \leq 1$$

This inequality says the “fraction of the binary tree” consumed by all codewords cannot exceed 1. Using Lagrange multipliers or a simple convexity argument, the minimum of $\sum p_i \ell_i$ subject to the Kraft inequality is achieved when $\ell_i = -\log_2 p_i$. This gives $L = H(X)$. Since $-\log_2 p_i$ is not generally an integer, we cannot always achieve this exactly, but we can always achieve $L < H(X) + 1$ by rounding up: $\ell_i = \lceil -\log_2 p_i \rceil$.

Huffman coding achieves this bound. It is optimal among all prefix-free codes.

Huffman’s Greedy Insight

Why does a greedy algorithm work here? The key is two structural facts about optimal trees.

Fact 1: The two rarest symbols should have the longest codewords.

More precisely, in any optimal tree, the two symbols with the smallest probabilities $p_{\min}$ and $p_{\min}'$ can always be assigned the longest codewords (the deepest leaves). Suppose in some optimal tree the deepest leaves are not the two rarest symbols. Then swapping the deepest leaf with the rarest symbol either decreases $L$ or leaves it unchanged - because moving a rarer symbol to a deeper position and a more common symbol to a shallower position never increases the average:

$$\Delta L = p_{\text{rare}} \cdot \ell_{\text{deep}} + p_{\text{common}} \cdot \ell_{\text{shallow}} - p_{\text{rare}} \cdot \ell_{\text{shallow}} - p_{\text{common}} \cdot \ell_{\text{deep}}$$ $$= (p_{\text{rare}} - p_{\text{common}})(\ell_{\text{deep}} - \ell_{\text{shallow}}) \leq 0$$

because $p_{\text{rare}} \leq p_{\text{common}}$ and $\ell_{\text{deep}} \geq \ell_{\text{shallow}}$. So swapping is always at least as good. This means we can assume, without any loss of optimality, that the two rarest symbols occupy the deepest positions.

Fact 2: The two rarest symbols can be siblings.

In a full binary tree (every internal node has exactly two children), every leaf either has a sibling leaf or a sibling subtree. The deepest leaf must have a sibling at the same depth, because if it had no sibling we could remove it and the tree would still be full, contradicting maximality. So the rarest symbol, being at the deepest level, has a sibling. We can always swap symbols so that the second-rarest symbol is that sibling - this is the same swap argument as Fact 1. The two rarest symbols can always be placed as siblings.

The greedy rule follows immediately. Since we can always find an optimal tree where the two rarest symbols are sibling leaves at the deepest level, we can “merge” them: replace both with a single internal node whose probability is $p_1 + p_2$. This reduces an $n$-symbol problem to an $(n-1)$-symbol problem. If we solve the smaller problem optimally, and then add two children to the node corresponding to the merged probability, we get an optimal solution to the original problem. Induction on $n$ completes the proof. The greedy rule - always merge the two smallest - is not just a heuristic. It is a certified strategy.

The Algorithm

Input: Symbols $s_1, \ldots, s_n$ with frequencies $f_1, \ldots, f_n$ (positive reals, not necessarily summing to 1).

Procedure:

Create a leaf node for each symbol, storing its frequency.
Insert all leaf nodes into a min-heap keyed by frequency.
Repeat $n - 1$ times:
- Extract the two nodes with minimum frequency: call them $x$ and $y$.
- Create a new internal node $z$ with $\text{freq}(z) = \text{freq}(x) + \text{freq}(y)$.
- Set $x$ and $y$ as children of $z$ (left $= 0$, right $= 1$ by convention).
- Insert $z$ into the heap.
The single remaining node is the root of the Huffman tree.
For each leaf, the codeword is the sequence of left/right choices on the path from root to that leaf, encoding left as “0” and right as “1”.

Worked Example

Consider five symbols with the following probabilities:

Symbol	Probability
A	0.4
B	0.2
C	0.2
D	0.1
E	0.1

The entropy is:

$$H = -(0.4 \log_2 0.4 + 0.2 \log_2 0.2 + 0.2 \log_2 0.2 + 0.1 \log_2 0.1 + 0.1 \log_2 0.1)$$ $$= -(0.4 \cdot (-1.322) + 0.2 \cdot (-2.322) + 0.2 \cdot (-2.322) + 0.1 \cdot (-3.322) + 0.1 \cdot (-3.322))$$ $$= 0.529 + 0.464 + 0.464 + 0.332 + 0.332 = 2.121 \text{ bits}$$

Step-by-Step Trace

Initial heap: D(0.1), E(0.1), B(0.2), C(0.2), A(0.4)

Iteration 1. Extract D(0.1) and E(0.1). Create internal node $z_1$ with frequency $0.1 + 0.1 = 0.2$. $z_1$ has children D (left, “0”) and E (right, “1”). Insert $z_1$.

Heap: B(0.2), C(0.2), $z_1$(0.2), A(0.4)

Iteration 2. Extract B(0.2) and C(0.2) (or $z_1$, depending on tie-breaking - the result is equivalent). Extract B(0.2) and C(0.2). Create internal node $z_2$ with frequency $0.2 + 0.2 = 0.4$. $z_2$ has children B (left, “0”) and C (right, “1”). Insert $z_2$.

Heap: $z_1$(0.2), A(0.4), $z_2$(0.4)

Iteration 3. Extract $z_1$(0.2) and A(0.4). Create internal node $z_3$ with frequency $0.2 + 0.4 = 0.6$. $z_3$ has children $z_1$ (left, “0”) and A (right, “1”). Insert $z_3$.

Heap: $z_2$(0.4), $z_3$(0.6)

Iteration 4. Extract $z_2$(0.4) and $z_3$(0.6). Create root node $z_4$ with frequency $0.4 + 0.6 = 1.0$. $z_4$ has children $z_2$ (left, “0”) and $z_3$ (right, “1”). Insert $z_4$.

Heap: $z_4$(1.0) - done.

The Huffman Tree

         z4 (1.0)
        /         \
    z2 (0.4)    z3 (0.6)
    /     \     /       \
  B(0.2) C(0.2) z1(0.2) A(0.4)
                /     \
             D(0.1)  E(0.1)

Extracting Codewords

Reading the path from root to each leaf:

Symbol	Path	Codeword	Length
B	left, left	00	2
C	left, right	01	2
D	right, left, left	100	3
E	right, left, right	101	3
A	right, right	11	2

Average Code Length

$$L = 0.2 \cdot 2 + 0.2 \cdot 2 + 0.1 \cdot 3 + 0.1 \cdot 3 + 0.4 \cdot 2$$ $$= 0.4 + 0.4 + 0.3 + 0.3 + 0.8 = 2.2 \text{ bits}$$

The entropy was $H \approx 2.121$ bits. Huffman coding achieves $L = 2.2$ bits, within $0.079$ bits of the entropy bound. The gap is small because the probabilities are close to powers of $\frac{1}{2}$ (in particular, $p_A = 0.4 \approx \frac{1}{4}$ and $p_B = p_C = 0.2 = \frac{1}{5}$ are not exact powers, so we lose a little). Whenever the probabilities are exactly powers of $\frac{1}{2}$, Huffman coding achieves entropy exactly.

Complexity

Building the initial heap: Given $n$ symbols, inserting all of them into a min-heap takes $O(n)$ time using the standard heapify procedure (or $O(n \log n)$ if you insert one at a time, though $O(n)$ is achievable via bottom-up heapification).

The merge loop: We perform exactly $n - 1$ iterations. Each iteration extracts the two minimum elements and inserts one new node. Each extract-min costs $O(\log n)$ (the heap has at most $n$ elements throughout, since we remove two and add one per iteration, so the size decreases). Each insert costs $O(\log n)$. Total: $O(n \log n)$.

Overall time complexity: $O(n \log n)$.

For small alphabets this is fast; for ASCII (256 symbols) or Unicode (over a million code points) it is essentially instantaneous.

Decoding: Given the Huffman tree and an encoded bitstream of length $m$, decoding is $O(m)$. For each bit, walk one step in the tree (left on “0”, right on “1”). When a leaf is reached, output the symbol, return to the root, and continue. No backtracking, no lookahead - the prefix-free property guarantees that reaching a leaf uniquely identifies the end of a codeword.

Optimality Proof Sketch

We want to show that the average code length $L$ of any Huffman code satisfies:

$$H(X) \leq L < H(X) + 1$$

Lower bound. This is Shannon’s source coding theorem. Any prefix-free code has codeword lengths $\ell_1, \ldots, \ell_n$ satisfying the Kraft inequality $\sum_i 2^{-\ell_i} \leq 1$. By the concavity of $\log$:

$$L - H(X) = \sum_i p_i \ell_i + \sum_i p_i \log_2 p_i = \sum_i p_i \log_2 \frac{p_i}{2^{-\ell_i}}$$

This is the Kullback-Leibler divergence $D_{\mathrm{KL}}(p | q)$ where $q_i = 2^{-\ell_i} / Z$ and $Z = \sum_i 2^{-\ell_i} \leq 1$. The KL divergence is always non-negative (by Jensen’s inequality applied to $-\log$), with equality iff $p_i = q_i$ for all $i$. Therefore $L \geq H(X)$.

Upper bound. Choose $\ell_i = \lceil -\log_2 p_i \rceil$. These lengths satisfy the Kraft inequality (since $\lceil -\log_2 p_i \rceil \leq -\log_2 p_i + 1$, so $\sum_i 2^{-\ell_i} \leq \sum_i p_i = 1$), and define a valid prefix-free code. The average length is:

$$L = \sum_i p_i \ell_i = \sum_i p_i \lceil -\log_2 p_i \rceil < \sum_i p_i (-\log_2 p_i + 1) = H(X) + 1$$

So there always exists a prefix-free code with $L < H(X) + 1$. Huffman coding achieves the minimum $L$ over all prefix-free codes - this is proved by the exchange argument given in the Greedy Insight section, applied inductively. Therefore Huffman coding is optimal among prefix-free codes, and its average length lies in $[H(X), H(X) + 1)$.

Read Next:

Data Compression - Throwing Away What You Can Reconstruct