Prerequisite: Maximum Likelihood Estimation


Maximum likelihood estimation is straightforward when you can write down the likelihood and differentiate. But many models include latent variables - unobserved quantities that explain the structure of the data. Marginalising them out often makes the likelihood intractable. The Expectation-Maximisation (EM) algorithm provides a principled, guaranteed-convergent method for finding local maxima of the marginal likelihood in these settings.

The Problem: MLE with Latent Variables

Given observed data $X = {x_1, \ldots, x_N}$ and latent variables $Z = {z_1, \ldots, z_N}$, we want to maximise the marginal log-likelihood:

$$\log p(X \mid \theta) = \log \sum_Z p(X, Z \mid \theta)$$

The sum inside the log prevents closed-form solutions in most cases. Direct gradient ascent is also difficult because the sum over $Z$ must be recomputed at every step.

The ELBO: Evidence Lower BOund

For any distribution $q(Z)$ over the latent variables, Jensen’s inequality gives:

$$\log p(X \mid \theta) = \log \sum_Z q(Z) \frac{p(X, Z \mid \theta)}{q(Z)} \geq \sum_Z q(Z) \log \frac{p(X, Z \mid \theta)}{q(Z)} =: \mathcal{L}(q, \theta)$$

The bound $\mathcal{L}$ is the Evidence Lower Bound (ELBO). Expanding:

$$\mathcal{L}(q, \theta) = \mathbb{E}_q[\log p(X, Z \mid \theta)] - \mathbb{E}_q[\log q(Z)] = \mathbb{E}_q[\log p(X, Z \mid \theta)] + H(q)$$

where $H(q)$ is the entropy of $q$. The gap between the log-likelihood and the ELBO is the KL divergence:

$$\log p(X \mid \theta) - \mathcal{L}(q, \theta) = \text{KL}(q(Z) ,|, p(Z \mid X, \theta)) \geq 0$$

with equality if and only if $q(Z) = p(Z \mid X, \theta)$.

EM as Coordinate Ascent on the ELBO

EM alternates between two steps that each maximise the ELBO with respect to one set of variables while holding the other fixed.

E-step (set $q$, fix $\theta = \theta^{\text{old}}$):

$$q(Z) \leftarrow p(Z \mid X, \theta^{\text{old}})$$

Setting $q$ to the posterior over $Z$ makes $\text{KL}(q | p) = 0$, so the ELBO equals the log-likelihood. The ELBO is tight.

M-step (optimise $\theta$, fix $q$):

$$\theta^{\text{new}} \leftarrow \arg\max_\theta , \mathbb{E}_q[\log p(X, Z \mid \theta)]$$

Since $H(q)$ does not depend on $\theta$, this is equivalent to maximising $\mathcal{L}(q, \theta)$ over $\theta$. The objective being maximised in the M-step is traditionally written:

$$Q(\theta, \theta^{\text{old}}) = \mathbb{E}_{p(Z \mid X, \theta^{\text{old}})}[\log p(X, Z \mid \theta)]$$

Convergence Proof

Claim. $\log p(X \mid \theta^{\text{new}}) \geq \log p(X \mid \theta^{\text{old}})$ - the log-likelihood is non-decreasing.

Proof. After the E-step the ELBO is tight: $\mathcal{L}(q, \theta^{\text{old}}) = \log p(X \mid \theta^{\text{old}})$. The M-step finds $\theta^{\text{new}}$ with $\mathcal{L}(q, \theta^{\text{new}}) \geq \mathcal{L}(q, \theta^{\text{old}})$. Since the ELBO is always a lower bound: $\log p(X \mid \theta^{\text{new}}) \geq \mathcal{L}(q, \theta^{\text{new}}) \geq \mathcal{L}(q, \theta^{\text{old}}) = \log p(X \mid \theta^{\text{old}})$. $\square$

EM converges to a local maximum (or saddle point) of the log-likelihood. The global maximum is not guaranteed.

Gaussian Mixture Models

A GMM with $K$ components models the data as:

$$p(x) = \sum_{k=1}^K \pi_k , \mathcal{N}(x \mid \mu_k, \Sigma_k)$$

where $\pi_k \geq 0$, $\sum_k \pi_k = 1$. The latent variable $z_n \in {1, \ldots, K}$ indicates which component generated observation $x_n$.

E-step: compute responsibilities.

$$r_{nk} = \frac{\pi_k , \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}{\sum_{j=1}^K \pi_j , \mathcal{N}(x_n \mid \mu_j, \Sigma_j)}$$

$r_{nk}$ is the posterior probability that component $k$ generated $x_n$. This is a “soft assignment”.

M-step: update parameters. Let $N_k = \sum_n r_{nk}$ (effective number of points assigned to cluster $k$):

$$\mu_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^N r_{nk} , x_n$$

$$\Sigma_k^{\text{new}} = \frac{1}{N_k} \sum_{n=1}^N r_{nk} (x_n - \mu_k^{\text{new}})(x_n - \mu_k^{\text{new}})^\top$$

$$\pi_k^{\text{new}} = \frac{N_k}{N}$$

Each iteration costs $O(N \cdot K \cdot d^2)$ where $d$ is the data dimension (the $d^2$ factor comes from updating full covariance matrices). With diagonal covariances the cost is $O(N \cdot K \cdot d)$.

Pseudocode:

EM_GMM(X, K, max_iter):
  // Initialise mu, Sigma, pi randomly (e.g. K-means++ init)
  for iter = 1 to max_iter:
    // E-step
    for n = 1 to N:
      for k = 1 to K:
        r[n][k] = pi[k] * Gaussian(x[n], mu[k], Sigma[k])
      normalise r[n] to sum to 1
    // M-step
    for k = 1 to K:
      N_k = sum_n r[n][k]
      mu[k]    = (1/N_k) * sum_n r[n][k] * x[n]
      Sigma[k] = (1/N_k) * sum_n r[n][k] * outer(x[n]-mu[k], x[n]-mu[k])
      pi[k]    = N_k / N
    if change in log-likelihood < eps: break
  return mu, Sigma, pi

HMM Parameter Learning: Baum-Welch

For HMMs, the E-step computes forward-backward probabilities (also called $\alpha$ and $\beta$ variables) using two passes of dynamic programming, yielding state occupation counts $\gamma_t(j)$ and transition counts $\xi_t(i, j)$ in $O(K^2 T)$. The M-step updates transition and emission matrices by normalised counting. This is the Baum-Welch algorithm - a specialised EM for HMMs.

Practical Considerations

Initialisation. EM is sensitive to initialisation because it finds local optima. Common strategies: random restarts, K-means initialisation for GMMs, or short runs of a simpler model.

Convergence. Monitor the log-likelihood $\log p(X \mid \theta)$ after each iteration. It must increase monotonically; a decrease signals a bug. Stop when the change is below a tolerance $\varepsilon$ (e.g. $10^{-6}$) or after a maximum number of iterations.

Singularities. In GMMs, if a component collapses onto a single data point, its covariance approaches zero and its likelihood goes to infinity. Fix by adding a small regularisation term $\lambda I$ to covariances or by restarting collapsed components.

Generalised EM. The M-step only requires an increase in $Q(\theta, \theta^{\text{old}})$, not a global maximum. Any ascent step preserves convergence. This is exploited in models where exact M-step maximisation is intractable.

Examples

Unsupervised clustering. GMM-EM is a probabilistic alternative to K-means that provides soft assignments and fits ellipsoidal clusters (not just spherical). Used in Gaussian background models for video surveillance and in medical image segmentation.

Missing data imputation. If some features $x_{nd}$ are missing, the E-step marginalises over them and the M-step updates parameters using only observed values (weighted by responsibilities). EM handles missing-data MLE in a unified framework without ad-hoc imputation.

Topic models. Probabilistic Latent Semantic Analysis (PLSA) is an EM algorithm where documents are mixtures of topics and words are emissions. The E-step computes topic assignments for each word occurrence; the M-step updates topic-word and document-topic distributions.


Read Next: GMMs