Mutual Information & Capacity - What One Variable Tells You About Another
Helpful context:
Shannon entropy $H(X) = -\sum_x p(x) \log p(x)$ measures the average uncertainty in a single random variable. But the most important questions in communication and learning involve two variables at once: how much does observing the output of a noisy channel tell you about the input? How much does a feature $Y$ reduce your uncertainty about a label $X$? These questions have a precise answer, and it is called mutual information.
The key observation is that uncertainty about $X$ before observing $Y$ is $H(X)$, and the average uncertainty remaining after observing $Y$ is the conditional entropy $H(X \mid Y)$. The gap between them - the uncertainty that $Y$ removes - is a quantity that is always non-negative, always symmetric, and connects directly to the deepest results in communication theory. Channel capacity, the data processing inequality, the information bottleneck, and the Shannon-Hartley theorem all flow from this one definition.
What makes this not just a definition but a theorem is Shannon’s channel coding theorem: the capacity $C = \max_{P(X)} I(X;Y)$ is not merely an information-theoretic quantity but a hard operational limit. Any rate below $C$ is achievable with vanishing error. Any rate above $C$ is not, by any code of any complexity. That is the result that turned information theory from a mathematical curiosity into the foundation of every modern communication system.
Mutual Information
Given two random variables $X$ and $Y$ with joint distribution $P(X, Y)$, the mutual information is
$$I(X; Y) = H(X) - H(X \mid Y).$$
This is the reduction in uncertainty about $X$ from learning $Y$. Three equivalent forms are all useful:
$$I(X; Y) = H(X) - H(X \mid Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y).$$
The second equality establishes symmetry: $I(X; Y) = I(Y; X)$. Knowing $X$ tells you exactly as much about $Y$ as knowing $Y$ tells you about $X$. This is not obvious from the definition, but it falls out of the chain rule $H(X, Y) = H(X) + H(Y \mid X) = H(Y) + H(X \mid Y)$.
The KL divergence interpretation. There is a third formula that reveals the geometric meaning:
$$I(X; Y) = D_{\text{KL}}\left(P(X,Y) | P(X) P(Y)\right) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}.$$
Mutual information is the KL divergence between the joint distribution and the product of the marginals. The product $P(X)P(Y)$ is what the joint would look like if $X$ and $Y$ were independent. So $I(X;Y)$ measures how far the actual joint distribution is from independence. Since KL divergence is always non-negative, $I(X; Y) \geq 0$, with equality if and only if $X$ and $Y$ are independent.
Examples.
If $Y = X$ (perfect knowledge), then $H(X \mid Y) = 0$ and $I(X; Y) = H(X)$. Learning $X$ perfectly removes all uncertainty.
If $X \perp Y$ (independence), then $H(X \mid Y) = H(X)$ and $I(X; Y) = 0$.
If $X$ is a bit and $Y = X \oplus N$ where $N \sim \text{Bernoulli}(p)$ is an independent noise bit, then $I(X; Y) = 1 - H_b(p)$ where $H_b(p) = -p\log p - (1-p)\log(1-p)$ is the binary entropy function. At $p = 0$ (no noise), $I = 1$ bit. At $p = 1/2$ (pure noise), $I = 0$.
The Data Processing Inequality
If $X \to Y \to Z$ is a Markov chain - meaning $Z$ is conditionally independent of $X$ given $Y$ - then
$$I(X; Z) \leq I(X; Y).$$
Proof sketch. By the chain rule for mutual information applied in two ways:
$$I(X; Y, Z) = I(X; Y) + I(X; Z \mid Y) = I(X; Z) + I(X; Y \mid Z).$$
The Markov condition $X \to Y \to Z$ means $I(X; Z \mid Y) = 0$ (given $Y$, knowing $Z$ adds no information about $X$). Also $I(X; Y \mid Z) \geq 0$ always. Combining: $I(X; Y) = I(X; Z) + I(X; Y \mid Z) \geq I(X; Z)$. $\square$
What this means. Any deterministic or randomized processing of $Y$ can only decrease or maintain the information about $X$. You cannot create information about $X$ by transforming $Y$. The further downstream you go, the less you know.
Machine learning implication. In a classification pipeline, let $X$ be the input, $Y$ the raw features, and $Z$ the learned representation. Then $I(Z; \text{label}) \leq I(Y; \text{label}) \leq I(X; \text{label})$. Every processing step - preprocessing, dimensionality reduction, intermediate layers - can only lose information about the target. A representation cannot be more informative about the label than the raw input. This bounds what any learned model can achieve.
Channel Capacity
A discrete memoryless channel is a conditional distribution $P(Y \mid X)$ mapping input alphabet $\mathcal{X}$ to output alphabet $\mathcal{Y}$. The channel is fixed - you cannot change $P(Y \mid X)$ - but you can choose the input distribution $P(X)$.
The capacity of the channel is
$$C = \max_{P(X)} I(X; Y).$$
Capacity is the maximum mutual information over all possible input distributions. It measures, in bits per channel use, how much information the channel can carry.
Why this is the right definition is Shannon’s channel coding theorem (below). But intuitively: by choosing $P(X)$, you control how much entropy you inject into the channel. The output $Y$ receives some of that entropy. The capacity is the maximum you can arrange to “get through.”
The Binary Symmetric Channel
The simplest non-trivial channel: input $X \in \{0,1\}$, output $Y \in \{0,1\}$, and each bit is flipped independently with probability $p$. So $P(Y \neq X) = p$.
The conditional distributions are $P(Y = x \mid X = x) = 1 - p$ and $P(Y = 1-x \mid X = x) = p$.
To find capacity, we maximize $I(X; Y) = H(Y) - H(Y \mid X)$ over $P(X)$.
The conditional entropy $H(Y \mid X)$ does not depend on $P(X)$: given $X$, the output is just $X$ flipped with probability $p$, so $H(Y \mid X) = H_b(p)$ regardless of the input distribution.
The output entropy $H(Y)$ is maximized when $Y$ is uniform, which happens when $P(X = 0) = P(X = 1) = 1/2$.
Therefore:
$$C = \max_{P(X)} I(X;Y) = 1 - H_b(p).$$
Sanity checks: At $p = 0$ (no flips), $H_b(0) = 0$ so $C = 1$ bit per use. At $p = 1/2$ (flips are pure noise), $H_b(1/2) = 1$ so $C = 0$. At $p = 1$ (every bit flipped), $H_b(1) = 0$ so $C = 1$ again - you know every output is flipped, so you can correct. The channel is most useless at $p = 1/2$, not $p = 1$.
The Gaussian Channel and Shannon-Hartley
The additive white Gaussian noise (AWGN) channel is
$$Y = X + N, \quad N \sim \mathcal{N}(0, \sigma^2), \quad E[X^2] \leq P.$$
The input has a power constraint $P$ (average energy per symbol), and noise variance $\sigma^2$.
Theorem (Shannon-Hartley). The capacity of the AWGN channel is
$$C = \frac{1}{2} \log_2\left(1 + \frac{P}{\sigma^2}\right) \text{ bits per channel use.}$$
The ratio $P/\sigma^2$ is the signal-to-noise ratio (SNR).
Proof sketch. The mutual information $I(X; Y) = h(Y) - h(Y \mid X) = h(Y) - h(N)$, where $h$ denotes differential entropy. Since $h(N) = \frac{1}{2}\log_2(2\pi e \sigma^2)$ is fixed, we want to maximize $h(Y)$. The output $Y = X + N$ has variance $\text{Var}(Y) = P + \sigma^2$ under the power constraint. For a fixed variance, Gaussian maximizes differential entropy, so $h(Y) \leq \frac{1}{2}\log_2(2\pi e (P + \sigma^2))$, with equality when $X \sim \mathcal{N}(0, P)$. Subtracting $h(N)$ gives $C = \frac{1}{2}\log_2(1 + P/\sigma^2)$.
What the formula tells you:
- Doubling the power ($P \to 2P$) adds approximately $\frac{1}{2}\log_2(1 + 2P/\sigma^2) - \frac{1}{2}\log_2(1 + P/\sigma^2)$, which at high SNR is roughly $\frac{1}{2}$ bit per channel use.
- Doubling the bandwidth: with bandwidth $B$ Hz, the channel supports $2B$ independent uses per second (Nyquist), giving capacity $B\log_2(1 + P/(\sigma^2))$ bits/second. Doubling $B$ doubles capacity proportionally.
- This is why bandwidth is more precious than power in wireless systems: you get more from doubling bandwidth than from doubling power.
The Shannon-Hartley theorem sets the theoretical maximum throughput for every WiFi connection, 5G link, and optical fiber cable. Modern codes (LDPC, turbo, polar) approach within a fraction of a dB of this limit.
Shannon’s Channel Coding Theorem
Theorem (Shannon, 1948). For a channel with capacity $C$:
- For any rate $R < C$ and any $\varepsilon > 0$, there exist codes of rate $R$ (i.e., $2^{nR}$ messages encoded in $n$ channel uses) with error probability less than $\varepsilon$ for sufficiently large $n$.
- For any rate $R > C$, the error probability of any code is bounded away from zero.
This theorem has two parts, and both matter. The achievability part says reliable communication at any rate below capacity is possible - you just need long enough codewords. The converse says rates above capacity are fundamentally impossible, not just hard.
The proof of achievability uses random coding: draw codewords i.i.d. from $P(X)^n$ and show that, with high probability, the nearest-neighbor decoder succeeds. The proof of the converse uses Fano’s inequality.
The practical gap. For decades after 1948, the best known codes were far below capacity. In the 1990s, turbo codes (Berrou and Glavieux, 1993) and LDPC codes (rediscovered from Gallager, 1963) were found to approach capacity within a fraction of a dB. Polar codes (Arikan, 2009) are provably capacity-achieving with efficient encoding and decoding. The gap between Shannon’s theorem and practical codes was an engineering frontier for 50 years.
The Information Bottleneck
A powerful application of mutual information to representation learning: you want to learn a compressed representation $Z$ of input $X$ that is maximally predictive of target $Y$. The data processing inequality says $I(Z; Y) \leq I(X; Y)$, so you cannot do better than the raw features. But you want to find the minimal $Z$ that preserves as much relevant information as possible.
The information bottleneck Lagrangian (Tishby, Pereira, Bialek, 1999) is:
$$\min_{P(Z \mid X)} ; I(X; Z) - \beta I(Z; Y).$$
The term $I(X; Z)$ penalizes the amount of information $Z$ retains about $X$ (compression). The term $I(Z; Y)$ rewards how much $Z$ predicts $Y$ (relevance). The tradeoff parameter $\beta$ controls the balance.
At $\beta \to 0$, the optimal $Z$ is a constant - no information at all. At $\beta \to \infty$, $Z$ retains everything about $X$ that is relevant to $Y$. The information bottleneck curve traces the Pareto frontier of compression versus relevance as $\beta$ varies.
This framework has been connected (with some debate) to understanding why deep networks generalize: intermediate layers may learn to discard information about $X$ that is irrelevant to $Y$, compressing toward the information bottleneck solution.
Summary
| Concept | Formula | Key fact |
|---|---|---|
| Mutual information | $I(X;Y) = H(X) - H(X \mid Y)$ | Symmetric: $I(X;Y) = I(Y;X)$ |
| KL divergence form | $I(X;Y) = D_{\text{KL}}(P(X,Y) | P(X)P(Y))$ | Zero iff $X \perp Y$ |
| Data processing | $X \to Y \to Z \Rightarrow I(X;Z) \leq I(X;Y)$ | Processing only loses information |
| Channel capacity | $C = \max_{P(X)} I(X;Y)$ | Hard operational limit |
| Binary symmetric channel | $C = 1 - H_b(p)$ | Zero capacity at $p = 1/2$ |
| Gaussian channel | $C = \frac{1}{2}\log_2(1 + P/\sigma^2)$ | Bandwidth beats power at high SNR |
| Shannon coding theorem | Rate $R < C$: reliable; $R > C$: impossible | Capacity is achievable and tight |
| Information bottleneck | $\min I(X;Z) - \beta I(Z;Y)$ | Compression-relevance tradeoff |
Read next: