Helpful context:


Shannon’s source coding theorem answers a clean question: what is the minimum number of bits needed to describe a source with zero error? The answer is entropy $H(X)$. But in practice, zero error is rarely necessary. A JPEG image does not need to reconstruct every pixel exactly. An MP3 does not need to reproduce every audio sample precisely. As long as the reconstruction is close enough - close enough that a human viewer or listener cannot tell the difference - you can use far fewer bits than $H(X)$.

Rate-distortion theory is Shannon’s answer to this generalized question. Given a source $X$ and a tolerance for inaccuracy, what is the minimum rate (bits per source symbol) needed to achieve distortion at most $D$? The result is a function $R(D)$ - the rate-distortion function - that tells you the fundamental trade-off between bit budget and reconstruction quality. Just as channel capacity is an upper bound that no communication system can exceed, $R(D)$ is a lower bound that no compression system can beat.

What makes this more than a definition is Shannon’s lossy source coding theorem: $R(D)$ is tight. Any rate above $R(D)$ is achievable with distortion at most $D$. Any rate below $R(D)$ forces distortion strictly greater than $D$. The boundary is exact, and it is what JPEG encoders, video codecs, and neural image compression systems are all trying to approach.


Distortion Measures

A distortion measure is a function $d: \mathcal{X} \times \hat{\mathcal{X}} \to [0, \infty)$, where $\hat{\mathcal{X}}$ is the reconstruction alphabet. The quantity $d(x, \hat{x})$ is the cost of reproducing $x$ as $\hat{x}$.

The two canonical choices:

Squared error (for continuous sources): $d(x, \hat{x}) = (x - \hat{x})^2$. This penalizes large errors quadratically and is the standard for audio, image, and signal sources.

Hamming distortion (for discrete sources): $d(x, \hat{x}) = \mathbf{1}[x \neq \hat{x}]$. Every mismatch costs 1. This is the right measure when errors are equally bad regardless of magnitude - for instance, when source symbols are unordered categories.

The expected distortion of a code is $E[d(X, \hat{X})]$, where the expectation is over the source distribution and the randomness of the encoder and decoder.


The Rate-Distortion Function

A rate-$R$ code for source $X$ with blocklength $n$ maps sequences $x^n = (x_1, \ldots, x_n)$ to one of $2^{nR}$ codewords and then to a reconstruction $\hat{x}^n$. The rate is $R$ bits per source symbol.

The rate-distortion function is

$$R(D) = \min_{\substack{P(\hat{X} \mid X) \\ E[d(X, \hat{X})] \leq D}} I(X; \hat{X}).$$

The minimization is over all conditional distributions $P(\hat{X} \mid X)$ - all “soft” lossy encoders - subject to the distortion constraint. The minimum achievable mutual information, given that you can afford distortion $D$, is the minimum number of bits per symbol you need.

Properties:

  • $R(D)$ is non-increasing: more distortion tolerance means fewer bits needed.
  • $R(D)$ is convex: the function curves downward; there are diminishing returns to accepting more distortion.
  • $R(0) = H(X)$ for discrete sources (zero distortion requires lossless coding).
  • $R(D) = 0$ for $D \geq D_{\max}$, where $D_{\max} = \min_{\hat{x}} E[d(X, \hat{x})]$ is the distortion achievable by ignoring the source entirely and outputting the best constant reconstruction.

The convexity of $R(D)$ has a practical meaning: the first bits you save (by tolerating small distortion) come cheaply, but pushing to very low rate requires accepting disproportionately large distortion.


Gaussian Source: The Explicit Formula

For a Gaussian source $X \sim \mathcal{N}(0, \sigma^2)$ with squared error distortion, the rate-distortion function has a closed form:

$$R(D) = \max\left(0, \frac{1}{2} \log_2 \frac{\sigma^2}{D}\right).$$

This holds for $0 \leq D \leq \sigma^2$; for $D > \sigma^2$, $R(D) = 0$ (just output zero).

Reading the formula:

  • At $D = \sigma^2$ (allowed distortion equals the variance), $R = 0$: outputting the mean (zero) achieves distortion $\sigma^2$ at zero rate.
  • As $D \to 0$, $R \to \infty$: arbitrarily accurate reconstruction requires arbitrarily many bits.
  • Halving the distortion ($D \to D/2$) costs exactly 1 extra bit per sample: $R(D/2) = R(D) + \frac{1}{2}\log_2 2 = R(D) + \frac{1}{2}$.

This $+\frac{1}{2}$ bit per halving is the rate-distortion analog of Shannon-Hartley’s relationship between power and capacity.

The optimal encoder for a Gaussian source is a quantizer that maps $X$ to the nearest point in a codebook drawn from $\mathcal{N}(0, \sigma^2 - D)$. The reconstruction error is $\hat{X} - X$, which has variance $D$ and is independent of $\hat{X}$ when the code achieves $R(D)$ exactly. This independence - the “innovation” is orthogonal to the reconstruction - is the Gaussian version of sufficiency.


Water-Filling for Non-i.i.d. Sources

For a Gaussian source with spectral density $S(f)$ (not white noise), the optimal rate allocation across frequencies follows the water-filling principle.

Imagine the spectral density $S(f)$ as the floor of a container. You “pour water” to a uniform level $\lambda$. Each frequency component $f$ gets rate allocation

$$R(f) = \max\left(0, \frac{1}{2}\log_2 \frac{\lambda}{S(f)}\right),$$

where $\lambda$ is chosen so the total rate equals the target $R$. High-power components (tall $S(f)$) get more bits; components with $S(f) > \lambda$ get zero bits (not worth encoding).

The total distortion is $D = \int \min(S(f), \lambda) df$: frequencies coded at rate $R(f) > 0$ contribute distortion $\lambda$; frequencies not coded contribute their full power $S(f)$.

Why this appears in practice. JPEG implicitly performs water-filling: the DCT decomposes the image into frequency components (the $8 \times 8$ block coefficients), and quantization allocates more bits to low-frequency components (large $S(f)$, carrying most of the image energy) and fewer to high-frequency components (small $S(f)$, fine detail). Audio codecs like MP3 do the same explicitly in the frequency domain.


Shannon’s Lossy Source Coding Theorem

Theorem. For a source $X$ with rate-distortion function $R(D)$:

  • For any $R > R(D)$, there exist codes of rate $R$ achieving expected distortion at most $D$.
  • For any code of rate $R < R(D)$, the expected distortion strictly exceeds $D$.

The achievability proof parallels the channel coding proof: generate $2^{nR}$ codewords i.i.d. from the distribution $P(\hat{X})$ that achieves $R(D)$. For a typical source sequence $x^n$, there exists a codeword $\hat{x}^n$ that is jointly typical with $x^n$ when $R > I(X; \hat{X})$. The expected distortion of this code converges to $E[d(X, \hat{X})]$ as $n \to \infty$.

The converse uses the fact that $I(X^n; \hat{X}^n) \leq nR$ for a rate-$R$ code, combined with the convexity of $R(D)$ to bound the achievable distortion below.

Like channel capacity, $R(D)$ is an information-theoretic limit: it says nothing about computational complexity, latency, or practical code design. Achieving $R(D)$ in practice requires long block codes and increasingly complex encoding. The gap between $R(D)$ and what practical codecs achieve is an ongoing engineering challenge.


Source-Channel Duality

Rate-distortion theory and channel capacity are dual problems:

Channel coding Source coding (rate-distortion)
Capacity $C = \max_{P(X)} I(X;Y)$ Rate-distortion $R(D) = \min_{P(\hat{X} \mid X)} I(X;\hat{X})$
Maximize mutual information Minimize mutual information
Over input distributions Over reconstruction channels
Limit: achievable rates Limit: achievable distortions

Channel capacity answers: given a channel, what is the maximum information I can get through? Rate-distortion answers: given a source and a distortion budget, what is the minimum information I need to preserve? One is a maximum; the other is a minimum. Both are mutual informations. The Shannon-Hartley theorem and the Gaussian rate-distortion formula are related by this duality: $C = \frac{1}{2}\log(1 + P/\sigma^2)$ and $R(D) = \frac{1}{2}\log(\sigma^2/D)$.


Applications: JPEG and Neural Image Compression

JPEG operates on an $R$-$D$ trade-off. The “quality” parameter controls quantization step sizes in the DCT domain. A coarser quantization step means more reconstruction error (higher $D$) but fewer bits after entropy coding (lower $R$). Choosing a quality parameter is choosing a point on the $R(D)$ curve - though JPEG does not explicitly compute $R(D)$ and does not achieve it optimally.

Neural image compression (Balle et al., Google, 2017 onwards) directly optimizes the rate-distortion Lagrangian. The encoder $f_\theta$ compresses image $X$ to a latent $Z = f_\theta(X)$; a quantizer and entropy model estimate the rate $R = -\log P(Z)$; the decoder $g_\phi$ reconstructs $\hat{X} = g_\phi(Z)$. Training minimizes

$$\mathcal{L} = R + \lambda D = -\log P(Z) + \lambda d(X, g_\phi(Z))$$

jointly over encoder, decoder, and entropy model parameters. Different values of $\lambda$ trace out different points on the $R$-$D$ curve. At high enough quality (low $\lambda$), neural codecs now outperform HEVC on standard benchmarks, achieving lower distortion at the same rate. They approach the true $R(D)$ more closely than handcrafted codecs because the encoder and decoder are jointly optimized.


Summary

Concept Formula Interpretation
Rate-distortion function $R(D) = \min_{P(\hat{X}\mid X): E[d] \leq D} I(X;\hat{X})$ Min bits/symbol at distortion $D$
Gaussian source, squared error $R(D) = \frac{1}{2}\log_2(\sigma^2/D)$ Halving distortion costs 1 bit
Properties $R(D)$ non-increasing, convex First bits cheapest; zero rate at $D_{\max}$
Water-filling $R(f) = \frac{1}{2}\log_2(\lambda/S(f))$ Allocate bits to high-power components
Shannon’s theorem $R > R(D)$ achievable; $R < R(D)$ not $R(D)$ is a tight operational limit
Neural compression Minimize $R + \lambda D$ end-to-end Approaches $R(D)$ with learned encoder/decoder

The rate-distortion function is as fundamental as channel capacity. It tells you not just how to compress, but how much compression is possible - and what you must sacrifice to get there.


Read next: