Prerequisite:

What Is Conjugacy?

Computing the posterior $p(\theta \mid x) \propto p(x \mid \theta)p(\theta)$ often leads to an intractable normalizing constant. Conjugate priors sidestep this: a prior $p(\theta)$ is conjugate to a likelihood $p(x \mid \theta)$ if the posterior $p(\theta \mid x)$ belongs to the same parametric family as the prior. Only the hyperparameters change; the functional form does not.

This yields closed-form updates and makes sequential Bayesian inference computationally trivial: the posterior after $n$ observations becomes the prior for the $(n+1)$th.

Conjugacy is not a restriction imposed on nature - it is a modeling choice that trades expressiveness for analytical tractability. When the true posterior is poorly approximated by a conjugate family, we must use variational methods or MCMC.

Beta-Binomial Conjugacy

Setup

Let $X \mid p \sim \text{Binomial}(n, p)$ and $p \sim \text{Beta}(\alpha, \beta)$. The Beta density is

$$p(p) = \frac{p^{\alpha-1}(1-p)^{\beta-1}}{B(\alpha,\beta)}, \quad p \in (0,1),$$

where $B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)$.

Posterior Derivation

Observing $X = k$ successes:

$$p(p \mid X = k) \propto \binom{n}{k}p^k(1-p)^{n-k} \cdot p^{\alpha-1}(1-p)^{\beta-1}.$$

The binomial coefficient $\binom{n}{k}$ does not depend on $p$, so

$$p(p \mid X = k) \propto p^{\alpha + k - 1}(1-p)^{\beta + n - k - 1}.$$

This is the kernel of $\text{Beta}(\alpha + k,, \beta + n - k)$. No integration required; the normalizing constant is $B(\alpha+k, \beta+n-k)^{-1}$ by recognition.

Interpretation. The hyperparameters $\alpha$ and $\beta$ act as pseudo-counts: $\alpha - 1$ prior successes and $\beta - 1$ prior failures. Observing $k$ successes and $n-k$ failures simply increments these counts. The posterior mean is

$$E[p \mid X = k] = \frac{\alpha + k}{\alpha + \beta + n},$$

a convex combination of the prior mean $\alpha/(\alpha+\beta)$ and the MLE $k/n$.

Dirichlet-Multinomial Conjugacy

The Dirichlet-Multinomial pair generalizes Beta-Binomial to $K$ categories. Let $(X_1, \ldots, X_K) \mid \theta \sim \text{Multinomial}(n, \theta)$ and $\theta \sim \text{Dirichlet}(\alpha_1, \ldots, \alpha_K)$. The posterior after observing counts $(c_1, \ldots, c_K)$ is

$$\theta \mid (c_1,\ldots,c_K) \sim \text{Dirichlet}(\alpha_1 + c_1, \ldots, \alpha_K + c_K).$$

The posterior mean for class $k$ is $(\alpha_k + c_k)/(\sum_j \alpha_j + n)$ - add-$\alpha$ smoothing. This is the foundation of Laplace smoothing in language models.

Normal-Normal (Known Variance)

Let $X_1, \ldots, X_n \mid \mu \sim \mathcal{N}(\mu, \sigma^2)$ with $\sigma^2$ known and prior $\mu \sim \mathcal{N}(\mu_0, \tau_0^2)$.

After observing $\bar{x}$ (sufficient statistic), the posterior is $\mu \mid x \sim \mathcal{N}(\mu_n, \tau_n^2)$ where

$$\frac{1}{\tau_n^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma^2}, \qquad \mu_n = \tau_n^2!\left(\frac{\mu_0}{\tau_0^2} + \frac{n\bar{x}}{\sigma^2}\right).$$

Precision (inverse variance) is additive: the posterior precision equals prior precision plus data precision $n/\sigma^2$. The posterior mean is a precision-weighted average of prior mean and data mean.

Normal-InverseGamma (Unknown Mean and Variance)

When both $\mu$ and $\sigma^2$ are unknown, the conjugate prior is the Normal-InverseGamma:

$$\sigma^2 \sim \text{InverseGamma}(a_0, b_0), \qquad \mu \mid \sigma^2 \sim \mathcal{N}!\left(\mu_0, \frac{\sigma^2}{\kappa_0}\right).$$

After observing $n$ data points with sample mean $\bar{x}$ and sum of squares $S = \sum(x_i - \bar{x})^2$, the posterior parameters update as:

$$\kappa_n = \kappa_0 + n, \quad \mu_n = \frac{\kappa_0 \mu_0 + n\bar{x}}{\kappa_n},$$ $$a_n = a_0 + \frac{n}{2}, \quad b_n = b_0 + \frac{S}{2} + \frac{\kappa_0 n(\bar{x} - \mu_0)^2}{2\kappa_n}.$$

The update for $b_n$ includes a term accounting for the discrepancy between the prior mean $\mu_0$ and the data mean $\bar{x}$.

Gamma-Poisson Conjugacy

Let $X_1, \ldots, X_n \mid \lambda \sim \text{Poisson}(\lambda)$ and $\lambda \sim \text{Gamma}(a, b)$ (shape $a$, rate $b$). The posterior is

$$\lambda \mid x \sim \text{Gamma}!\left(a + \sum_{i=1}^n x_i,; b + n\right).$$

The posterior mean is $(a + \sum x_i)/(b + n)$, a shrinkage of the MLE $\bar{x}$ toward the prior mean $a/b$.

Table of Conjugate Pairs

Likelihood Conjugate Prior Posterior
Binomial$(n,p)$ Beta$(\alpha, \beta)$ Beta$(\alpha+k, \beta+n-k)$
Multinomial$(n,\theta)$ Dirichlet$(\alpha)$ Dirichlet$(\alpha + c)$
Poisson$(\lambda)$ Gamma$(a,b)$ Gamma$(a+\sum x_i, b+n)$
Gaussian$(\mu, \sigma^2)$, $\sigma^2$ known Gaussian$(\mu_0, \tau_0^2)$ Gaussian$(\mu_n, \tau_n^2)$
Gaussian$(\mu, \sigma^2)$ Normal-InverseGamma Normal-InverseGamma
Exponential$(\lambda)$ Gamma$(a,b)$ Gamma$(a+n, b+\sum x_i)$
Geometric$(p)$ Beta$(\alpha,\beta)$ Beta$(\alpha+n, \beta+\sum x_i)$

Variational Bayes When Conjugacy Fails

When the model is not in an exponential family or when latent variables break conjugacy, the posterior is intractable. Variational Bayes (VB) approximates the posterior with a tractable distribution $q(\theta)$ from a family $\mathcal{Q}$ by minimizing the KL divergence:

$$q^\ast = \arg\min_{q \in \mathcal{Q}} D_{\text{KL}}(q(\theta) | p(\theta \mid x)).$$

This is equivalent to maximizing the evidence lower bound (ELBO):

$$\mathcal{L}(q) = E_q[\log p(x, \theta)] - E_q[\log q(\theta)] \leq \log p(x).$$

Under the mean-field assumption $q(\theta) = \prod_j q_j(\theta_j)$, the optimal factor satisfies

$$\log q_j^\ast(\theta_j) = E_{q_{-j}}[\log p(x, \theta)] + \text{const}.$$

When the model belongs to the exponential family, each factor update is conjugate and closed-form, yielding coordinate-ascent variational inference (CAVI) - a scalable alternative to MCMC.

Examples

Online Bayesian updating. Conjugacy makes sequential learning trivial. A Beta$(1,1)$ (uniform) prior on $p$, updated after each Bernoulli trial, gives the posterior Beta$(1 + \text{successes}, 1 + \text{failures})$. No re-processing of past data is needed; the current posterior is sufficient. This is the Bayesian analog of online learning.

Posterior updating in A/B testing. In a conversion rate experiment, users in variant A convert with rate $p_A$ and in variant B with rate $p_B$. With independent Beta priors on each, the posteriors after observing $k_A$ of $n_A$ and $k_B$ of $n_B$ conversions are:

$$p_A \mid \text{data} \sim \text{Beta}(\alpha + k_A, \beta + n_A - k_A), \quad p_B \mid \text{data} \sim \text{Beta}(\alpha + k_B, \beta + n_B - k_B).$$

The probability $P(p_B > p_A \mid \text{data})$ can be computed by Monte Carlo sampling from both posteriors. Unlike frequentist A/B testing, this Bayesian approach allows early stopping with valid uncertainty quantification and does not require specifying a sample size in advance.


Read Next: