Helpful context:


Bayesian inference is powerful in principle. In practice, there is a problem. The posterior is:

$$P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) P(\theta)}{\int P(\text{data} \mid \theta) P(\theta) d\theta}.$$

That denominator is an integral over $\theta$. In low dimensions with simple models, it is often tractable. In anything realistic - continuous parameters, multiple parameters, complex likelihoods - it is not. You would need numerical integration over a high-dimensional space, which becomes exponentially expensive as dimension grows.

Conjugate priors are a mathematical trick that sidesteps the problem entirely. The idea: choose your prior from a family of distributions such that the posterior lands in the same family. Then Bayesian updating is not an integral - it is just a parameter update. No integrals, no numerical methods, exact arithmetic.


The Definition

A prior $P(\theta)$ is conjugate to a likelihood $P(x \mid \theta)$ if the posterior $P(\theta \mid x)$ is in the same parametric family as the prior.

“Same family” means same functional form, just with different parameters. If your prior was Beta and your posterior is also Beta, the Beta family is conjugate to that likelihood.

The payoff is dramatic: Bayesian updating reduces to “update the parameters.” With conjugate priors, you can often do exact Bayesian inference with nothing more than addition.


Beta-Bernoulli: The Fundamental Example

Let us work through the most important conjugate pair in full detail.

The setup

You observe coin flips. Each flip is Bernoulli($p$): the coin lands heads with probability $p$. You do not know $p$. After $n$ flips with $k$ heads, you want to update your beliefs about $p$.

The likelihood for $k$ heads in $n$ flips is:

$$P(\text{data} \mid p) = \binom{n}{k} p^k (1-p)^{n-k} \propto p^k (1-p)^{n-k}.$$

The Beta prior

Choose the prior $p \sim \text{Beta}(\alpha, \beta)$, defined by:

$$P(p) = \frac{1}{B(\alpha,\beta)} p^{\alpha-1}(1-p)^{\beta-1}, \quad p \in [0,1],$$

where $B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)$ is the normalizing constant.

The Beta distribution lives on $[0,1]$, making it a natural distribution over probabilities. Its mean is $\alpha/(\alpha+\beta)$. When $\alpha=\beta$, it is symmetric about 0.5. When $\alpha > \beta$, it skews toward 1. When $\alpha=\beta=1$, it is the Uniform distribution - every value of $p$ equally likely.

Interpretation of the parameters. Think of $\alpha$ and $\beta$ as pseudo-counts: $\alpha$ represents the number of prior successes you believe you have seen (plus one), and $\beta$ the number of prior failures. More precisely, $\alpha - 1$ prior successes and $\beta - 1$ prior failures. So $\text{Beta}(1,1)$ encodes zero prior data - a flat prior.

The posterior update

Now compute the posterior:

$$P(p \mid \text{data}) \propto P(\text{data} \mid p) \cdot P(p) = p^k (1-p)^{n-k} \cdot p^{\alpha-1}(1-p)^{\beta-1} = p^{(\alpha+k)-1}(1-p)^{(\beta+n-k)-1}.$$

This is the kernel of $\text{Beta}(\alpha + k, \beta + n - k)$.

The posterior is also a Beta distribution, with updated parameters:

$$\text{Beta}(\alpha, \beta) \xrightarrow{k \text{ heads in } n \text{ flips}} \text{Beta}(\alpha + k, \beta + n - k).$$

The parameters just add up. Prior successes plus observed successes. Prior failures plus observed failures.

Discomfort check. This looks too simple. Where did the integral go? The normalizing constant $\int P(\text{data} \mid p) P(p) dp$ is not zero - we just recognized that the unnormalized posterior has the form $p^{(\alpha+k)-1}(1-p)^{(\beta+n-k)-1}$, which must integrate to $B(\alpha+k, \beta+n-k)$ by definition of the Beta function. So we know the normalizing constant without computing the integral - we read it off from the functional form.

Concrete example

Start with $\text{Beta}(1,1)$ - a flat prior. Observe 3 heads in 5 flips.

  • Posterior: $\text{Beta}(4, 3)$.
  • Posterior mean: $4/7 \approx 0.571$.
  • Posterior mode (MAP): $(4-1)/(4+3-2) = 3/5 = 0.6$ - this equals the MLE, as expected with a flat prior.

Now update again. Observe 7 more heads in 10 more flips (total: 10 heads in 15 flips from scratch, or use the previous posterior).

  • New posterior: $\text{Beta}(4+7, 3+3) = \text{Beta}(11, 6)$.
  • Posterior mean: $11/17 \approx 0.647$.

This is sequential Bayesian updating: today’s posterior becomes tomorrow’s prior, and the parameters just accumulate.


Dirichlet-Categorical: The Generalization

The Beta-Bernoulli pair generalizes naturally to more than two outcomes.

Suppose each observation falls into one of $K$ categories with probabilities $\mathbf{p} = (p_1, \ldots, p_K)$ where $\sum p_k = 1$. The outcome is Categorical($\mathbf{p}$): a single draw takes value $k$ with probability $p_k$.

The conjugate prior for $\mathbf{p}$ is the Dirichlet distribution:

$$P(\mathbf{p}) = \frac{\Gamma(\sum_k \alpha_k)}{\prod_k \Gamma(\alpha_k)} \prod_k p_k^{\alpha_k - 1}.$$

After observing $n_k$ occurrences of category $k$ (with $\sum_k n_k = n$ total observations):

$$\text{Dir}(\alpha_1, \ldots, \alpha_K) \xrightarrow{\text{data}} \text{Dir}(\alpha_1 + n_1, \ldots, \alpha_K + n_K).$$

Same principle: parameters add up. The Dirichlet is the $K$-dimensional generalization of the Beta; with $K=2$ it reduces to Beta.

The Dirichlet-Categorical conjugacy is central to several important models. In Latent Dirichlet Allocation (LDA), documents are modeled as mixtures of topics, topics as distributions over words - both use Dirichlet priors over the simplex. Language models use it to estimate word probabilities. Mixture models use it to estimate mixture weights.


Normal-Normal: Known Variance

Suppose you observe data $x_1, \ldots, x_n$ from $\mathcal{N}(\mu, \sigma^2)$ with known $\sigma^2$. You want to estimate $\mu$.

The likelihood for the sample mean $\bar{x}$ is:

$$P(\bar{x} \mid \mu) \propto \exp\left(-\frac{n(\bar{x} - \mu)^2}{2\sigma^2}\right).$$

Choose a Gaussian prior on $\mu$: $\mu \sim \mathcal{N}(\mu_0, \sigma_0^2)$.

The posterior is also Gaussian: $\mu \mid \text{data} \sim \mathcal{N}(\mu_n, \sigma_n^2)$ where:

$$\frac{1}{\sigma_n^2} = \frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}, \qquad \mu_n = \sigma_n^2 \left(\frac{\mu_0}{\sigma_0^2} + \frac{n\bar{x}}{\sigma^2}\right).$$

The key operation is precision (inverse variance), not variance. Precisions add: posterior precision equals prior precision plus data precision. The posterior mean is a precision-weighted average of the prior mean and the sample mean.

If the prior is very tight ($\sigma_0^2$ small, high prior precision), the posterior stays near $\mu_0$ regardless of data. If the prior is vague ($\sigma_0^2 \to \infty$, zero prior precision), the posterior mean approaches $\bar{x}$ and the posterior precision approaches $n/\sigma^2$ - you recover the MLE estimate with the correct uncertainty.

As $n \to \infty$, the data precision $n/\sigma^2$ dominates and the prior is forgotten. The posterior concentrates at the true $\mu$.


Gamma-Poisson: Count Data

Suppose you observe count data $x_1, \ldots, x_n$ from $\text{Poisson}(\lambda)$: events that occur with rate $\lambda$ per unit time. The likelihood is:

$$P(\text{data} \mid \lambda) \propto \lambda^{\sum x_i} e^{-n\lambda}.$$

The conjugate prior for $\lambda$ is the Gamma distribution, $\lambda \sim \text{Gamma}(\alpha, \beta)$:

$$P(\lambda) \propto \lambda^{\alpha-1} e^{-\beta \lambda}, \quad \lambda > 0.$$

Here $\alpha$ is the shape and $\beta$ is the rate parameter. The Gamma prior has mean $\alpha/\beta$.

After observing $n$ observations with total count $S = \sum_{i=1}^n x_i$:

$$\text{Gamma}(\alpha, \beta) \xrightarrow{\text{data}} \text{Gamma}(\alpha + S, \beta + n).$$

The update: add the total observed count to the shape parameter, add the number of observations to the rate parameter. The posterior mean is $(\alpha + S)/(\beta + n)$, a smoothed version of the MLE $S/n$.

Interpretation. The prior $\text{Gamma}(\alpha, \beta)$ encodes the belief that you have seen $\alpha$ total events in $\beta$ time periods (pseudo-observations). Add real data: $\alpha + S$ total events in $\beta + n$ time periods.


The Deeper Pattern: Exponential Families

These examples are not isolated coincidences. Every exponential family likelihood has a conjugate prior, and the structure is always the same.

An exponential family likelihood has the form:

$$P(x \mid \eta) = h(x) \exp\left(\eta^\top T(x) - A(\eta)\right),$$

where $\eta$ are the natural parameters, $T(x)$ are the sufficient statistics, and $A(\eta)$ is the log-partition function.

The conjugate prior for $\eta$ is:

$$P(\eta \mid \chi, \nu) \propto \exp\left(\chi^\top \eta - \nu A(\eta)\right).$$

After observing $n$ data points with sufficient statistics $\sum T(x_i)$:

$$(\chi, \nu) \xrightarrow{\text{data}} \left(\chi + \sum_{i=1}^n T(x_i),; \nu + n\right).$$

The posterior parameters are the sum of prior parameters and sufficient statistics. This is why conjugate updates always reduce to addition - you are adding sufficient statistics to the prior pseudo-counts.

The Beta-Bernoulli, Dirichlet-Categorical, Normal-Normal, and Gamma-Poisson pairs are all instances of this single exponential-family framework.


The Practical Limitation

Discomfort check. Conjugate priors are chosen for mathematical convenience, not because they accurately represent your actual beliefs. A Beta prior may be the wrong shape for your beliefs about $p$. A Gamma prior may not capture your uncertainty about $\lambda$. Using a conjugate prior when your actual beliefs are non-conjugate is a modeling error - you are fitting your beliefs to a mathematically convenient shape rather than representing them honestly.

How much does it matter? With large data, the likelihood dominates and the posterior is nearly independent of the prior. The prior matters when:

  1. Data is scarce - fewer observations than you would need for the likelihood to overwhelm prior assumptions.
  2. The prior is far from the truth - a strongly misspecified prior can take many observations to correct.
  3. You are in a tail - the prior’s behavior in extreme regions (high or low $p$, large $\lambda$) can matter even asymptotically if the truth is in a tail.

In practice: use conjugate priors when you want exact, cheap updates and either (a) you have enough data, or (b) the conjugate prior is actually a reasonable model of your beliefs. Use non-conjugate priors when faithfulness to your actual beliefs is more important than analytical tractability, and accept that you will need approximate inference.


Connection to Regularization

We noted in Bayesian Inference - Updating Your Beliefs the Mathematically Correct Way that MAP estimation with a Gaussian prior is equivalent to L2 regularization. Conjugate priors make this connection concrete and general.

L2 regularization (ridge regression): minimize $-\log P(\text{data} \mid \mathbf{w}) + \frac{\lambda}{2}|\mathbf{w}|^2$. This is MAP with a Gaussian prior $P(\mathbf{w}) \propto \exp(-|\mathbf{w}|^2 / (2\sigma^2))$, where $\lambda = 1/\sigma^2$. The Gaussian is the conjugate prior for the Normal likelihood.

L1 regularization (lasso): minimize $-\log P(\text{data} \mid \mathbf{w}) + \lambda |\mathbf{w}|_1$. This is MAP with a Laplace prior $P(\mathbf{w}) \propto \exp(-\lambda|\mathbf{w}|)$. The Laplace prior is not a conjugate prior for the Normal likelihood, but it is the natural choice if you want sparsity - the Laplace has heavier tails that allow some weights to be exactly zero.

The regularization coefficient is not arbitrary: it is $1/(2\sigma^2)$ for Gaussian, or $\lambda/b$ for Laplace (where $b$ is the Laplace scale). Cross-validation tunes the regularization coefficient; in Bayesian terms, this is empirical Bayes - using data to estimate the prior hyperparameters.


Summary

Likelihood Conjugate Prior Posterior Update
Bernoulli($p$) Beta($\alpha, \beta$) Beta($\alpha + k$, $\beta + n - k$)
Categorical($\mathbf{p}$) Dirichlet($\boldsymbol{\alpha}$) Dirichlet($\boldsymbol{\alpha} + \mathbf{n}$)
Normal($\mu$, $\sigma^2$) known $\sigma^2$ Normal($\mu_0$, $\sigma_0^2$) Normal (precision-weighted average)
Poisson($\lambda$) Gamma($\alpha$, $\beta$) Gamma($\alpha + S$, $\beta + n$)
Any exponential family Natural conjugate Natural parameters add

Conjugate priors turn Bayesian inference into parameter arithmetic. They are mathematically elegant and computationally cheap. Their limitation is that they may not represent your actual beliefs. With large data, the distinction usually does not matter. With small data, the choice of prior - conjugate or not - is a genuine modeling decision that affects your conclusions.

Use them deliberately, not by default.


Read next: