Prerequisite:


Information theory asks a deceptively simple question: how much “information” does an event carry? The answer, due to Claude Shannon (1948), rests on a single formula that appears across machine learning, statistics, compression, and communication theory.

Shannon Entropy

Definition. Let $X$ be a discrete random variable taking values in $\mathcal{X}$ with probability mass function $p(x) = P(X = x)$. The Shannon entropy of $X$ is:

$$H(X) = -\sum_{x \in \mathcal{X}} p(x) \log p(x)$$

When the logarithm is base 2 the unit is bits; when it is the natural logarithm the unit is nats. By convention $0 \log 0 = 0$ (since $\lim_{p \to 0^+} p \log p = 0$).

Intuitively, $-\log p(x)$ is the “surprise” of observing outcome $x$: rare events are more surprising. Entropy is the expected surprise.

Properties of Entropy

Non-negativity. $H(X) \geq 0$ for all $X$, with equality if and only if $X$ is deterministic (one outcome has probability 1).

Proof. Each term $-p(x)\log p(x) \geq 0$ because $p(x) \in [0,1]$ implies $\log p(x) \leq 0$. Equality holds iff $p(x) \in {0,1}$ for all $x$, meaning exactly one outcome has probability 1. $\square$

Maximum entropy: uniform distribution. Among all distributions on $|\mathcal{X}| = n$ outcomes, entropy is maximized by the uniform distribution, giving $H(X) = \log n$.

Proof via Lagrange multipliers. Maximize $H = -\sum_i p_i \log p_i$ subject to $\sum_i p_i = 1$. Form the Lagrangian $\mathcal{L} = -\sum_i p_i \log p_i - \lambda(\sum_i p_i - 1)$. Setting $\partial \mathcal{L}/\partial p_i = 0$:

$$-\log p_i - 1 - \lambda = 0 \implies p_i = e^{-1-\lambda}$$

This is constant in $i$, so $p_i = 1/n$ for all $i$. Substituting gives $H = \log n$. Since $H$ is strictly concave in $(p_1, \ldots, p_n)$, this critical point is a global maximum. $\square$

Joint and Conditional Entropy

Definition. The joint entropy of $(X, Y)$ is:

$$H(X, Y) = -\sum_{x,y} p(x,y) \log p(x,y)$$

Definition. The conditional entropy of $X$ given $Y$ is:

$$H(X \mid Y) = \sum_y p(y), H(X \mid Y = y) = -\sum_{x,y} p(x,y) \log p(x \mid y)$$

Theorem (Chain Rule). $H(X, Y) = H(Y) + H(X \mid Y)$.

Proof. Write $\log p(x,y) = \log p(y) + \log p(x \mid y)$ and take expectations under $p(x,y)$:

$$H(X,Y) = -\mathbb{E}[\log p(X,Y)] = -\mathbb{E}[\log p(Y)] - \mathbb{E}[\log p(X \mid Y)] = H(Y) + H(X \mid Y). \quad \square$$

Corollary. $H(X \mid Y) \leq H(X)$, with equality iff $X$ and $Y$ are independent. Conditioning reduces uncertainty on average.

Mutual Information

Definition. The mutual information between $X$ and $Y$ is:

$$I(X; Y) = H(X) - H(X \mid Y) = H(X) + H(Y) - H(X, Y)$$

Equivalently, $I(X; Y) = D_{KL}(p(x,y) ,|, p(x)p(y))$, the KL divergence between the joint and the product of marginals.

Symmetry. $I(X; Y) = I(Y; X)$, which follows immediately from the formula $H(X) + H(Y) - H(X,Y)$.

Non-negativity. $I(X; Y) \geq 0$, with equality iff $X \perp Y$. This follows from non-negativity of KL divergence.

Data Processing Inequality

Theorem. If $X \to Y \to Z$ form a Markov chain (so $Z \perp X \mid Y$), then $I(X; Y) \geq I(X; Z)$.

Proof. By the chain rule for mutual information applied to $(Y, Z)$:

$$I(X; Y, Z) = I(X; Z) + I(X; Y \mid Z) = I(X; Y) + I(X; Z \mid Y)$$

Since $X \perp Z \mid Y$ we have $I(X; Z \mid Y) = 0$. Since mutual information is non-negative, $I(X; Y \mid Z) \geq 0$. Therefore:

$$I(X; Y) = I(X; Z) + I(X; Y \mid Z) \geq I(X; Z). \quad \square$$

This is fundamental: no deterministic function $Z = f(Y)$ can increase information about $X$ beyond what $Y$ already contains. Processing can only lose information.

Differential Entropy

For a continuous random variable $X$ with density $f$, the differential entropy is:

$$h(X) = -\int_{-\infty}^{\infty} f(x) \log f(x), dx$$

Unlike discrete entropy, differential entropy can be negative (e.g., a uniform on $[0, \delta]$ has $h = \log \delta < 0$ for $\delta < 1$). It is not invariant under reparameterisation.

Theorem. Among all densities with fixed variance $\sigma^2$, the Gaussian $\mathcal{N}(0, \sigma^2)$ maximises differential entropy, with $h(X) = \frac{1}{2}\log(2\pi e\sigma^2)$.

Proof. Let $g$ be the $\mathcal{N}(0,\sigma^2)$ density. For any density $f$ with variance $\sigma^2$, the KL divergence satisfies $D_{KL}(f | g) \geq 0$, i.e. $-h(f) \leq -\int f \log g$. Computing the right side:

$$-\int f(x)\log g(x),dx = \frac{1}{2}\log(2\pi\sigma^2) + \frac{1}{2\sigma^2}\mathbb{E}_f[X^2] = \frac{1}{2}\log(2\pi\sigma^2) + \frac{1}{2} = \frac{1}{2}\log(2\pi e\sigma^2)$$

Therefore $h(f) \leq \frac{1}{2}\log(2\pi e\sigma^2)$, achieved when $f = g$. $\square$

Source Coding Theorem

Theorem (Shannon, 1948). For a discrete source $X$ with entropy $H(X)$ bits, no lossless uniquely-decodable binary code can achieve expected code length below $H(X)$. Moreover there exists a code achieving expected length less than $H(X) + 1$.

$$H(X) \leq \mathbb{E}[L(X)] < H(X) + 1$$

Lower bound. By Kraft’s inequality, any uniquely-decodable code satisfies $\sum_x 2^{-L(x)} \leq 1$. Define $q(x) = 2^{-L(x)}/Z$ with $Z \leq 1$. Then:

$$\mathbb{E}[L(X)] = \sum_x p(x) L(x) = -\sum_x p(x)\log_2 q(x) - \log_2 Z \geq H(X) + D_{KL}(p|q) \geq H(X). \quad \square$$

The Huffman code achieves the upper bound; arithmetic coding approaches $H(X)$ for long sequences, realising the theoretical minimum asymptotically.

Examples

Feature selection via mutual information. Given features $X_1, \ldots, X_d$ and label $Y$, the score $I(X_j; Y)$ measures statistical dependence without assuming linearity, capturing polynomial and threshold relationships that Pearson correlation would miss. A greedy forward selection procedure picks features maximising conditional mutual information $I(X_j; Y \mid X_{\text{selected}})$ at each step, reducing redundancy among the selected set. This is the MRMR (minimum redundancy maximum relevance) framework widely used in bioinformatics and NLP.

Information bottleneck. Given inputs $X$ and labels $Y$, the information bottleneck method finds a compressed representation $T$ of $X$ that maximally preserves information about $Y$ while compressing $X$. It minimises $I(X; T) - \beta, I(T; Y)$ over all Markov chains $Y \leftarrow X \to T$. The data processing inequality guarantees $I(T; Y) \leq I(X; Y)$, bounding what any representation can retain. The tradeoff curve, varying $\beta$, is called the information plane. This framework has been used to interpret the representations learned by deep neural networks, with layers argued to first increase $I(T; Y)$ during fitting and then reduce $I(X; T)$ during compression - though the empirical claims remain an active area of debate.


The natural companion to entropy is KL divergence: a directed measure of how much one distribution deviates from another, built from the same logarithmic foundation developed here.

Read Next: