KL Divergence
Prerequisite:
KL divergence quantifies how much one probability distribution differs from a reference distribution. It appears in variational inference, reinforcement learning, information theory, and the analysis of statistical estimators - always as the natural measure of the “cost” of using the wrong distribution.
Definition
Definition. Let $P$ and $Q$ be probability distributions over the same space $\mathcal{X}$. The Kullback-Leibler divergence (also called relative entropy) from $P$ to $Q$ is:
$$D_{KL}(P | Q) = \sum_{x \in \mathcal{X}} P(x) \log \frac{P(x)}{Q(x)}$$
for discrete distributions, and for continuous distributions with densities $p$, $q$:
$$D_{KL}(P | Q) = \int p(x) \log \frac{p(x)}{q(x)}, dx$$
By convention $0 \log (0/q) = 0$ and $p \log(p/0) = +\infty$ whenever $p > 0$. Thus $D_{KL}(P | Q) = +\infty$ whenever $Q$ assigns zero probability to an event that $P$ does not - $Q$ must be absolutely continuous with respect to $P$ for the divergence to be finite.
Non-negativity: The Information Inequality
Theorem (Information Inequality). For all distributions $P$ and $Q$:
$$D_{KL}(P | Q) \geq 0$$
with equality if and only if $P = Q$ almost everywhere.
Proof via Jensen’s inequality. The function $f(t) = -\log t$ is strictly convex (since $f''(t) = 1/t^2 > 0$). Write:
$$D_{KL}(P | Q) = -\sum_x P(x) \log \frac{Q(x)}{P(x)} = \mathbb{E}_P!\left[-\log \frac{Q(X)}{P(X)}\right]$$
By Jensen’s inequality applied to the convex function $-\log$:
$$\mathbb{E}_P!\left[-\log \frac{Q(X)}{P(X)}\right] \geq -\log \mathbb{E}_P!\left[\frac{Q(X)}{P(X)}\right] = -\log \sum_x P(x) \cdot \frac{Q(x)}{P(x)} = -\log 1 = 0$$
Equality in Jensen holds iff $Q(x)/P(x)$ is constant $P$-almost everywhere, which forces $P = Q$. $\square$
Asymmetry
KL divergence is not symmetric: $D_{KL}(P | Q) \neq D_{KL}(Q | P)$ in general. Consider $P = \mathcal{N}(0, 1)$ and $Q = \mathcal{N}(0, 4)$. Then $D_{KL}(P | Q) = \frac{1}{2}(\log 4 - 1 + 1/4) \approx 0.44$ while $D_{KL}(Q | P) = \frac{1}{2}(\log(1/4) - 1 + 4) \approx 0.97$. Because $D_{KL}$ is not a metric it is often called a divergence rather than a distance.
Forward vs. Reverse KL
The direction of KL divergence matters for approximation. In variational inference we approximate a target $P$ with a simpler $Q$.
Forward KL (moment-matching or mean-seeking): minimise $D_{KL}(P | Q)$. This penalises $Q$ heavily whenever $Q(x) \approx 0$ but $P(x) > 0$, forcing $Q$ to cover all modes of $P$. The optimal $Q$ satisfies $\mathbb{E}_P[\log Q(X)] = $ constant, which for exponential family $Q$ gives $Q$’s sufficient statistics matching $P$’s - hence “moment-matching.”
Reverse KL (mode-seeking): minimise $D_{KL}(Q | P)$. This penalises $Q$ wherever $Q(x) > 0$ but $P(x) \approx 0$, encouraging $Q$ to concentrate on a single mode of $P$ rather than spreading across all modes. Variational inference in the standard ELBO sense minimises the reverse KL.
Cross-Entropy and Its Relation to KL
Definition. The cross-entropy of $Q$ relative to $P$ is:
$$H(P, Q) = -\sum_x P(x) \log Q(x) = \mathbb{E}_P[-\log Q(X)]$$
Theorem. $H(P, Q) = H(P) + D_{KL}(P | Q)$.
Proof.
$$H(P, Q) = -\sum_x P(x)\log Q(x) = -\sum_x P(x)\log P(x) + \sum_x P(x)\log\frac{P(x)}{Q(x)} = H(P) + D_{KL}(P | Q). \quad \square$$
In classification, $P$ is the one-hot label distribution and $Q$ is the model’s predicted distribution. Minimising the cross-entropy loss $H(P, Q)$ is equivalent to minimising $D_{KL}(P | Q)$ since $H(P)$ is constant with respect to model parameters.
Jensen-Shannon Divergence
To recover symmetry, define $M = (P + Q)/2$. The Jensen-Shannon divergence is:
$$\text{JSD}(P | Q) = \frac{1}{2}D_{KL}(P | M) + \frac{1}{2}D_{KL}(Q | M)$$
Properties.
- Symmetric: $\text{JSD}(P | Q) = \text{JSD}(Q | P)$.
- Bounded: $0 \leq \text{JSD}(P | Q) \leq \log 2$ (using natural logarithm; bounded by 1 in bits).
- Its square root $\sqrt{\text{JSD}}$ is a metric, satisfying the triangle inequality.
JSD is the divergence used by generative adversarial networks (GANs): the original GAN objective can be shown to minimise $\text{JSD}(P_{\text{data}} | P_G)$ between the data distribution and the generator’s distribution.
Rényi Divergence
KL divergence is a special case of the broader Rényi divergence family, parametrised by order $\alpha \geq 0$:
$$D_\alpha(P | Q) = \frac{1}{\alpha - 1} \log \sum_x P(x)^\alpha Q(x)^{1-\alpha}$$
At $\alpha \to 1$, $D_\alpha(P | Q) \to D_{KL}(P | Q)$ by L’Hôpital’s rule. At $\alpha = 2$, it is the $\chi^2$-divergence (up to a constant). Rényi divergences appear in differential privacy (where the privacy budget is often measured in Rényi divergence units) and in the analysis of Monte Carlo estimators.
KL in Variational Inference
In variational inference we want to approximate the posterior $p(z \mid x)$ with a tractable family $q_\phi(z)$. Bayes' rule gives $\log p(x) = \log \int p(x, z), dz$. Introducing $q_\phi$:
$$\log p(x) = \mathbb{E}{q\phi}!\left[\log \frac{p(x, z)}{q_\phi(z)}\right] + D_{KL}(q_\phi(z) | p(z \mid x))$$
Since $D_{KL} \geq 0$, the first term is a lower bound on $\log p(x)$, called the Evidence Lower BOund (ELBO):
$$\mathcal{L}(\phi) = \mathbb{E}{q\phi}[\log p(x \mid z)] - D_{KL}(q_\phi(z) | p(z)) \leq \log p(x)$$
Maximising the ELBO simultaneously maximises the data likelihood and minimises the KL divergence between the approximate and true posterior.
Examples
VAE loss decomposition. A Variational Autoencoder maximises the ELBO over both encoder $q_\phi(z \mid x)$ and decoder $p_\theta(x \mid z)$. The loss is $-\mathcal{L} = -\mathbb{E}{q\phi}[\log p_\theta(x \mid z)] + D_{KL}(q_\phi(z \mid x) | p(z))$. The first term is a reconstruction loss; the second is an explicit KL penalty that regularises the latent space toward the prior $p(z) = \mathcal{N}(0, I)$. For Gaussian encoder $q_\phi(z \mid x) = \mathcal{N}(\mu, \sigma^2 I)$, the KL term has a closed form: $\frac{1}{2}\sum_j (\mu_j^2 + \sigma_j^2 - \log\sigma_j^2 - 1)$.
PPO KL penalty. Proximal Policy Optimisation constrains policy updates using a KL penalty term $-\beta, D_{KL}(\pi_{\theta_{\text{old}}} | \pi_\theta)$ added to the reward objective. This prevents the updated policy $\pi_\theta$ from deviating too far from the reference policy $\pi_{\theta_{\text{old}}}$, avoiding destructive large updates in policy gradient methods. The coefficient $\beta$ is adapted over training to maintain the divergence near a target value. In RLHF pipelines, the same penalty prevents the fine-tuned language model from drifting too far from the pretrained reference, preserving fluency while aligning to human preferences.
With entropy and KL divergence in hand, the tools of information theory connect naturally to learning theory - where the question becomes not “how much information?” but “how much data is needed to learn a concept class?”
Read Next: