Helpful context:


Imagine you have height measurements of 1000 people, but you don’t know their genders. You suspect the data comes from two Gaussians - one for each gender. How do you fit both Gaussians when you don’t know which person belongs to which?

You can’t - unless you guess. Use your guess to improve your estimate. Use the better estimate to improve your guess. Repeat until it stops changing.

This is EM: the Expectation-Maximization algorithm. It’s how you fit models with hidden structure.


The Incomplete Data Problem

When fitting a single Gaussian to data, maximum likelihood gives you exact formulas: the mean is the sample mean, the variance is the sample variance. Clean and closed-form.

With two Gaussians and unknown labels, you’re stuck. If you knew which Gaussian each point came from, you could fit each one separately. If you had both Gaussians, you could figure out which one each point came from. You need each answer to compute the other.

This is the incomplete data problem. The observed data $X = \{x_1, \ldots, x_N\}$ is the complete picture only if you also knew the hidden labels $Z = \{z_1, \ldots, z_N\}$. The marginal log-likelihood you want to maximize is:

$$\log p(X \mid \theta) = \log \sum_Z p(X, Z \mid \theta)$$

The sum inside the log prevents closed-form optimization in most cases.


The EM Template

EM resolves the circularity by alternating between two steps.

E-step (Expectation). Given your current parameter estimate $\theta^\text{old}$, compute the expected value of the hidden variables. More precisely, compute the expected complete-data log-likelihood under the posterior over $Z$:

$$Q(\theta, \theta^\text{old}) = \mathbb{E}_{p(Z \mid X, \theta^\text{old})}[\log p(X, Z \mid \theta)]$$

This is the “soft assignment” step. Instead of hard labels (point $x_n$ definitely belongs to cluster $k$), you compute soft responsibilities: the probability that each point came from each component, given where the components currently are.

M-step (Maximization). Maximize $Q$ over $\theta$:

$$\theta^\text{new} = \arg\max_\theta Q(\theta, \theta^\text{old})$$

This is the “update parameters” step. With soft assignments in hand, you re-fit the model parameters as weighted averages - each point contributes to each component proportionally to its responsibility.

Repeat until the log-likelihood converges.


Gaussian Mixture Models: The Full Example

A GMM with $K$ components models data as:

$$p(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x \mid \mu_k, \Sigma_k)$$

where $\pi_k$ are mixing weights summing to 1. The hidden variable $z_n \in \{1, \ldots, K\}$ says which component generated $x_n$.

E-step: compute responsibilities.

$$r_{nk} = \frac{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(x_n \mid \mu_j, \Sigma_j)}$$

$r_{nk}$ is the posterior probability that component $k$ generated $x_n$. This is just Bayes' theorem: prior is $\pi_k$, likelihood is the Gaussian, normalize.

M-step: update parameters. Let $N_k = \sum_{n=1}^N r_{nk}$ (the effective number of points assigned to component $k$):

$$\mu_k^\text{new} = \frac{1}{N_k} \sum_{n=1}^{N} r_{nk} x_n$$

$$\Sigma_k^\text{new} = \frac{1}{N_k} \sum_{n=1}^{N} r_{nk} (x_n - \mu_k^\text{new})(x_n - \mu_k^\text{new})^\top$$

$$\pi_k^\text{new} = \frac{N_k}{N}$$

These are weighted versions of the usual MLE formulas, with responsibility weights $r_{nk}$ standing in for the unknown hard labels.

Each iteration costs $O(N \cdot K \cdot d^2)$ for full covariance matrices (the $d^2$ term comes from the outer product in the covariance update). With diagonal covariances: $O(NKd)$.


Why It Converges (But Not to the Global Optimum)

EM is coordinate ascent on the Evidence Lower Bound (ELBO):

$$\mathcal{L}(q, \theta) = \mathbb{E}_q[\log p(X, Z \mid \theta)] - \mathbb{E}_q[\log q(Z)]$$

The E-step tightens the bound by setting $q(Z) = p(Z \mid X, \theta^\text{old})$, making the ELBO equal to the log-likelihood. The M-step maximizes the ELBO over $\theta$. Because the ELBO always lower-bounds the log-likelihood, this guarantees the log-likelihood increases at every step.

Discomfort check. EM is guaranteed to converge - but to what? Why can’t it find the global maximum?

EM monotonically increases the log-likelihood at each step. So it converges. But “converges” just means it reaches a stationary point - it could be a local maximum, not the global one. The log-likelihood surface of a GMM with $K$ components has many local maxima. Which one you reach depends entirely on where you start.

The practical fix: run EM from multiple random initializations, take the solution with the highest converged log-likelihood. K-means++ initialization (choosing starting means by a weighted random procedure) also tends to find better local optima than uniform random initialization.


K-Means as Hard-Assignment EM

K-means is EM with two restrictions:

  1. Hard assignments: $r_{nk} \in \{0, 1\}$ - each point belongs to exactly one cluster.
  2. Spherical covariance: $\Sigma_k = \sigma^2 I$ - all clusters have the same radius.

The E-step becomes: assign each point to its nearest centroid. The M-step becomes: move each centroid to the mean of its assigned points. Everything is exact and cheap, which is why K-means scales to large datasets. But it misses elongated or overlapping clusters that GMM handles naturally.


Applications

Speech recognition. Training an HMM-based speech recognizer: the observation model parameters (Gaussian means, covariances) are learned by Baum-Welch, which is EM for HMMs. The E-step computes state occupation probabilities via the forward-backward algorithm. The M-step re-estimates emission and transition parameters.

Topic modeling. In probabilistic Latent Semantic Analysis (PLSA), documents are mixtures of topics and words are emissions from topics. The E-step assigns words to topics (soft assignments). The M-step re-estimates topic-word and document-topic distributions. Latent Dirichlet Allocation (LDA) extends this with a Bayesian prior.

Medical imaging. PET scan reconstruction: the observed data is photon counts (incomplete, noisy), the hidden data is the true activity distribution in the patient’s body. EM alternates between estimating the activity distribution and updating the reconstruction. The ML-EM algorithm is the standard reconstruction method for PET.

Astronomy. Fitting star cluster membership: given photometric measurements of stars, determine which belong to a cluster and which are background stars. EM fits a mixture of cluster and background distributions, assigning soft membership probabilities.


Summary

Concept Detail
Problem Maximize $\log p(X \mid \theta)$ with hidden variables $Z$
E-step Compute $Q(\theta, \theta^\text{old}) = \mathbb{E}_{p(Z \mid X, \theta^\text{old})}[\log p(X, Z \mid \theta)]$
M-step $\theta^\text{new} = \arg\max_\theta Q(\theta, \theta^\text{old})$
GMM E-step Compute soft responsibilities $r_{nk} = p(z_n = k \mid x_n, \theta)$
GMM M-step Weighted mean, covariance, mixing weight
Convergence Log-likelihood increases monotonically per iteration
Limitation Converges to local optimum; initialization matters
K-means EM with hard assignments and spherical covariance

Read next: