Entropy & Information Theory - The Mathematics of Surprise
Helpful context:
- Probability Distributions - The Shapes That Randomness Takes
- Expectation, Variance & Covariance - The Center, the Spread, and the Relationship
In 1948, a mathematician and engineer at Bell Labs named Claude Shannon published a paper that would reshape the century. The question he was trying to answer sounds almost philosophical: how much information is in a message?
Not “how long is the message” or “how important is the message” - but how much genuine, irreducible information does it carry? Can you compress it, and if so, by how much? Is there a hard floor below which no compression scheme can go?
Shannon found that the answer is a precise number, computable from the probabilities of the symbols in the message. That number - entropy - now governs everything from ZIP files to how neural networks learn to classify images. Understanding it means understanding why some things compress easily and others do not, and why cross-entropy is the loss function of choice in machine learning.
The Key Intuition: Rare Events Carry More Information
Before any formulas, consider two statements:
- “The sun rose this morning.”
- “It snowed in Cairo in July.”
The first carries almost no information. You already knew the sun would rise. Learning it happened tells you nothing new. The second is shocking - it is an extremely rare event, and learning it genuinely updates your picture of the world in a large way.
This is the core intuition: information is inversely related to probability. Certain events carry zero information. Rare events carry a lot.
We want a function $I(x)$ that measures the information content of observing outcome $x$ with probability $p(x)$. It should satisfy:
- $I(x) \geq 0$ - information is non-negative.
- If $p(x) = 1$ (certain), then $I(x) = 0$.
- If $p(x)$ is very small, $I(x)$ is very large.
- If two independent events both occur, their information contents add: $I(x, y) = I(x) + I(y)$ when $X \perp Y$.
The only function satisfying all four properties is the logarithm:
$$I(x) = -\log P(x) = \log \frac{1}{P(x)}.$$
This is called the self-information of the event $x$. In base 2, the unit is bits; in base $e$ (natural log), the unit is nats. The choice of base is a choice of units - like measuring distance in miles versus kilometers.
A fair coin flip has self-information $-\log_2(1/2) = 1$ bit. A fair die showing 3 has self-information $\log_2 6 \approx 2.58$ bits. An event with probability $1/1024$ carries exactly 10 bits.
Shannon Entropy: The Expected Information
If $I(x) = -\log p(x)$ is the information in a single outcome, what is the average information across all possible outcomes? This is the Shannon entropy:
$$H(X) = E[-\log P(X)] = -\sum_{x \in \mathcal{X}} p(x) \log p(x).$$
By convention, $0 \log 0 = 0$ (since $p \log p \to 0$ as $p \to 0^+$, so isolated impossible events contribute nothing).
Entropy measures the average unpredictability of a random variable. A high-entropy variable surprises you a lot, on average. A low-entropy variable is mostly predictable.
Examples:
A fair coin ($p = 1/2$ for heads and tails): $H = -\frac{1}{2}\log_2\frac{1}{2} - \frac{1}{2}\log_2\frac{1}{2} = 1$ bit. One bit of uncertainty per flip.
A biased coin with $p = 0.9$ for heads: $H = -0.9\log_2(0.9) - 0.1\log_2(0.1) \approx 0.469$ bits. Knowing the coin is heavily biased toward heads, you are already less surprised on average.
A uniform distribution over $n$ outcomes: $H = -\sum_{i=1}^n \frac{1}{n}\log\frac{1}{n} = \log_2 n$ bits. This is the maximum possible entropy for $n$ outcomes.
A certain outcome ($p(x_0) = 1$): $H = -1 \cdot \log 1 = 0$. No uncertainty, no information.
The maximum entropy principle: entropy is maximized by the uniform distribution. When you know nothing about which outcomes are more likely, you are maximally uncertain, and the uniform distribution reflects this. Any deviation from uniformity - any lumpiness in probability - reduces entropy.
Joint and Conditional Entropy
When we have two random variables $X$ and $Y$, we need to track uncertainty about each, and about their relationship.
The joint entropy is just the entropy of the pair, treating $(X, Y)$ as a single random variable:
$$H(X, Y) = -\sum_{x, y} p(x, y) \log p(x, y).$$
The conditional entropy $H(X \mid Y)$ measures how much uncertainty about $X$ remains, on average, after you learn $Y$:
$$H(X \mid Y) = \sum_y p(y) H(X \mid Y = y) = -\sum_{x, y} p(x, y) \log p(x \mid y).$$
These are related by the chain rule:
$$H(X, Y) = H(Y) + H(X \mid Y).$$
Reading this left to right: the total uncertainty in the pair equals the uncertainty in $Y$ plus the remaining uncertainty in $X$ once you know $Y$. Proof: write $\log p(x, y) = \log p(y) + \log p(x \mid y)$ and take expectations under the joint distribution.
A key consequence: $H(X \mid Y) \leq H(X)$, with equality if and only if $X$ and $Y$ are independent. Learning something can only reduce or maintain uncertainty - it cannot increase it. Conditioning on extra information never makes things more confusing on average.
Mutual Information
How much does knowing $Y$ tell you about $X$? The answer is mutual information:
$$I(X; Y) = H(X) - H(X \mid Y).$$
This is the reduction in uncertainty about $X$ that comes from observing $Y$. By symmetry of the formula $H(X) + H(Y) - H(X, Y)$ (which you get by applying the chain rule twice), mutual information is symmetric: $I(X; Y) = I(Y; X)$. Knowing $X$ tells you exactly as much about $Y$ as knowing $Y$ tells you about $X$ - a non-obvious fact.
Mutual information is always non-negative: $I(X; Y) \geq 0$, with equality if and only if $X$ and $Y$ are independent. This follows from properties of KL divergence that we will come to shortly.
Unlike correlation, mutual information captures nonlinear dependence - two variables can have zero correlation but positive mutual information, if they are related in a nonlinear way.
Cross-Entropy: The Wrong Code
Return to the compression picture. If $X$ has distribution $p$, the optimal code assigns outcome $x$ a codeword of length $-\log_2 p(x)$. The expected length of this code is $H(p) = -\sum_x p(x) \log p(x)$ - Shannon’s entropy.
Now suppose you do not know $p$. Instead, you have a model distribution $q$, and you design your code based on $q$ - assigning codeword lengths $-\log q(x)$. But the true outcomes are actually drawn from $p$. The expected code length under this mismatch is the cross-entropy:
$$H(p, q) = -\sum_x p(x) \log q(x).$$
Cross-entropy is always at least as large as entropy: $H(p, q) \geq H(p)$, with equality if and only if $p = q$. Using the wrong distribution to design your code can only waste bits, never save them.
In machine learning, cross-entropy is the standard loss function for classification. Your model produces a probability distribution $q$ over classes; the true label is a one-hot distribution $p$. The cross-entropy loss $H(p, q) = -\log q(y_{\text{true}})$ penalizes the model for assigning low probability to the correct class.
KL Divergence: The Gap Between Distributions
The excess cost of using $q$ instead of $p$ is the difference $H(p, q) - H(p)$. This quantity is the Kullback-Leibler divergence (also called relative entropy):
$$D_{\text{KL}}(p | q) = H(p, q) - H(p) = \sum_x p(x) \log \frac{p(x)}{q(x)}.$$
KL divergence measures how different $q$ is from $p$, from $p$’s point of view. Key properties:
- Non-negativity: $D_{\text{KL}}(p | q) \geq 0$ for all distributions $p, q$. This follows from Jensen’s inequality applied to the convex function $-\log$.
- Zero iff equal: $D_{\text{KL}}(p | q) = 0$ if and only if $p = q$ everywhere.
- Asymmetry: $D_{\text{KL}}(p | q) \neq D_{\text{KL}}(q | p)$ in general.
Discomfort check. KL divergence is not a distance in the mathematical sense. It is not symmetric, and it does not satisfy the triangle inequality. The direction matters. $D_{\text{KL}}(p | q)$ - “forward KL” or “inclusive KL” - penalizes cases where $p(x) > 0$ but $q(x) \approx 0$: $q$ must cover every mode that $p$ has. This pushes $q$ to spread its mass broadly. $D_{\text{KL}}(q | p)$ - “reverse KL” - penalizes cases where $q(x) > 0$ but $p(x) \approx 0$: $q$ must not put mass where $p$ does not. This pushes $q$ to concentrate on $p$’s modes. In variational inference, minimizing the reverse KL leads to mode-seeking behavior (the approximate posterior collapses onto one mode of the true posterior); minimizing the forward KL leads to mean-seeking behavior (the approximate posterior spreads to cover all modes).
Why Minimizing Cross-Entropy is Maximum Likelihood
Suppose your model is a parameterized distribution $q_\theta(x)$ and you observe data $x_1, \ldots, x_n$ drawn i.i.d. from the true distribution $p$. The empirical distribution of your data is $\hat{p}(x) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[X_i = x]$.
The cross-entropy of $\hat{p}$ and $q_\theta$ is:
$$H(\hat{p}, q_\theta) = -\sum_x \hat{p}(x) \log q_\theta(x) = -\frac{1}{n}\sum_{i=1}^n \log q_\theta(x_i).$$
Minimizing this over $\theta$ is identical to maximizing $\frac{1}{n}\sum_i \log q_\theta(x_i)$, which is the log-likelihood. Minimizing cross-entropy loss equals maximum likelihood estimation. The two are mathematically equivalent; cross-entropy is just the language of information theory, while likelihood is the language of statistics.
Equivalently, minimizing $H(\hat{p}, q_\theta)$ minimizes $D_{\text{KL}}(\hat{p} | q_\theta) + H(\hat{p})$, and since $H(\hat{p})$ is fixed, this minimizes the KL divergence between your model and the empirical data distribution.
Shannon’s Source Coding Theorem
Shannon’s source coding theorem makes precise the connection between entropy and compression:
Theorem. For a discrete source $X$ with entropy $H(X)$ bits, no lossless code can achieve an expected code length below $H(X)$ bits per symbol. Moreover, there exist codes achieving expected length less than $H(X) + 1$ bits.
In other words, $H(X)$ is the absolute minimum average description length of samples from $X$. Entropy is the theoretical compression floor. You cannot do better; you can get arbitrarily close.
Huffman coding achieves the upper bound (at most $H(X) + 1$ bits). Arithmetic coding approaches $H(X)$ for long sequences. Practical compressors like gzip and lz77 approximate these optimal codes by modeling the source distribution - they work better on English text than on random binary data because English has much lower entropy than a uniform distribution over all byte sequences. A file full of zeros compresses almost to nothing; a file of truly random bytes cannot be compressed at all.
The deeper point: if a file compresses well, it has structure - its distribution has low entropy. If it does not compress, it looks random - its distribution is nearly uniform. Compression and information content are two sides of the same coin.
Summary
| Concept | Formula |
|---|---|
| Self-information | $I(x) = -\log P(x)$ |
| Shannon entropy | $H(X) = -\sum_x p(x)\log p(x)$ |
| Maximum entropy | Achieved by uniform distribution: $H = \log n$ |
| Chain rule | $H(X, Y) = H(Y) + H(X \mid Y)$ |
| Mutual information | $I(X;Y) = H(X) - H(X \mid Y)$ |
| Cross-entropy | $H(p, q) = -\sum_x p(x)\log q(x) \geq H(p)$ |
| KL divergence | $D_{\text{KL}}(p | q) = \sum_x p(x)\log\frac{p(x)}{q(x)} \geq 0$ |
| Cross-entropy loss | Equivalent to maximum likelihood estimation |
Shannon’s framework started as a theory of communication over noisy channels. It turned out to be a theory of information itself - applicable wherever probability and uncertainty appear. The loss functions of modern deep learning, the foundations of Bayesian inference, the principles of data compression, and the analysis of communication systems all speak this language.
Read next: