Bayesian Inference
Prerequisite:
The Bayesian Paradigm
In frequentist statistics, the parameter $\theta$ is a fixed unknown constant. Probability statements refer to the long-run frequency of procedures - confidence intervals, p-values - not to beliefs about $\theta$ itself.
The Bayesian paradigm treats $\theta$ as a random variable with a prior distribution $p(\theta)$ encoding beliefs before observing data. After observing $x = (x_1, \ldots, x_n)$, Bayes' theorem updates these beliefs to yield the posterior distribution:
$$p(\theta \mid x) = \frac{p(x \mid \theta),p(\theta)}{p(x)}.$$
Since $p(x) = \int p(x \mid \theta)p(\theta),d\theta$ does not depend on $\theta$, the posterior is proportional to
$$p(\theta \mid x) \propto p(x \mid \theta),p(\theta) = L(\theta; x),p(\theta).$$
The posterior combines the likelihood with the prior, balancing data evidence against prior beliefs.
Prior Choice
Informative Priors
An informative prior encodes genuine domain knowledge. For example, a physician’s prior on a disease prevalence might be concentrated near 0.01 based on historical data. Informative priors can dramatically improve inference when data are scarce.
Non-Informative Priors
When prior knowledge is limited, we seek a prior that is “non-informative.” A naive choice - flat (uniform) priors - is problematic because flatness is not preserved under reparametrization. If $p(\theta) \propto 1$ and we reparametrize to $\phi = \log \theta$, then $p(\phi) \propto e^\phi$, which is far from flat.
Jeffreys prior resolves this:
$$p(\theta) \propto \sqrt{I(\theta)},$$
where $I(\theta) = -E!\left[\frac{\partial^2 \log p(X|\theta)}{\partial \theta^2}\right]$ is the Fisher information. Jeffreys prior is invariant under reparametrization: if $\phi = g(\theta)$, the induced prior on $\phi$ is also the Jeffreys prior for the $\phi$-parametrized model.
Why? Under $\phi = g(\theta)$, $I_\phi(\phi) = I_\theta(\theta)\bigl(d\theta/d\phi\bigr)^2$, so $\sqrt{I_\phi(\phi)} = \sqrt{I_\theta(\theta)},|d\theta/d\phi|$, which is exactly the change-of-variables formula for densities.
For the Bernoulli model, $I(p) = 1/[p(1-p)]$, giving the Jeffreys prior $p(p) \propto p^{-1/2}(1-p)^{-1/2}$, which is $\text{Beta}(1/2, 1/2)$.
Posterior Summaries
The full posterior $p(\theta \mid x)$ is the complete inferential object. Common point summaries:
- Posterior mean: $E[\theta \mid x] = \int \theta,p(\theta|x),d\theta$. Minimizes posterior expected squared error.
- MAP estimate: $\hat{\theta}{\text{MAP}} = \arg\max\theta p(\theta \mid x) = \arg\max_\theta [\log p(x|\theta) + \log p(\theta)]$. Equivalent to penalized MLE with penalty $-\log p(\theta)$.
- Posterior median: minimizes posterior expected absolute error.
Bayesian Credible Intervals
A $95%$ credible interval $[a, b]$ satisfies $P(a \leq \theta \leq b \mid x) = 0.95$. This is a direct probability statement about $\theta$ - fundamentally different from a frequentist confidence interval, which says the procedure covers the true $\theta$ in 95% of repeated experiments.
The highest posterior density (HPD) interval ${θ : p(\theta|x) \geq c}$ is the shortest credible interval at a given level.
The Normal-Normal Model
Let $X_1, \ldots, X_n \mid \mu \sim \mathcal{N}(\mu, \sigma^2)$ with $\sigma^2$ known. Use the prior $\mu \sim \mathcal{N}(\mu_0, \tau^2)$.
The posterior is:
$$p(\mu \mid x) \propto \exp!\left(-\frac{(\mu - \mu_0)^2}{2\tau^2}\right) \exp!\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i - \mu)^2\right).$$
Completing the square in $\mu$: the exponent is $-\frac{1}{2}\left(\frac{1}{\tau^2} + \frac{n}{\sigma^2}\right)\mu^2 + \left(\frac{\mu_0}{\tau^2} + \frac{n\bar{x}}{\sigma^2}\right)\mu + \text{const}$.
This is a Gaussian in $\mu$ with posterior precision $\kappa = \frac{1}{\tau^2} + \frac{n}{\sigma^2}$ and posterior mean
$$\mu_n = \frac{1}{\kappa}\left(\frac{\mu_0}{\tau^2} + \frac{n\bar{x}}{\sigma^2}\right) = \frac{\tau^2/n}{\sigma^2/n + \tau^2}\bar{x} + \frac{\sigma^2/n}{\sigma^2/n + \tau^2}\mu_0.$$
The posterior mean is a precision-weighted average of the MLE $\bar{x}$ and the prior mean $\mu_0$. As $n \to \infty$, $\mu_n \to \bar{x}$ (data overwhelms the prior). As $\tau^2 \to \infty$ (diffuse prior), $\mu_n \to \bar{x}$ as well.
The posterior variance is $1/\kappa = \frac{\sigma^2\tau^2}{\sigma^2 + n\tau^2}$, strictly smaller than both $\tau^2$ (prior variance) and $\sigma^2/n$ (MLE variance).
Posterior Predictive Distribution
The posterior predictive for a new observation $\tilde{x}$ integrates out the parameter:
$$p(\tilde{x} \mid x) = \int p(\tilde{x} \mid \theta),p(\theta \mid x),d\theta.$$
This automatically accounts for parameter uncertainty. In the Normal-Normal model, $\tilde{X} \mid x \sim \mathcal{N}(\mu_n, \sigma^2 + 1/\kappa)$, where the extra $1/\kappa$ reflects uncertainty in $\mu$.
Bayesian Decision Theory
Given a loss function $L(\theta, a)$ for taking action $a$ when the true parameter is $\theta$, the Bayes optimal action minimizes the posterior expected loss:
$$a^\ast = \arg\min_a E_{\theta|x}[L(\theta, a)] = \arg\min_a \int L(\theta, a),p(\theta|x),d\theta.$$
Under squared error loss $L(\theta, a) = (\theta - a)^2$, the Bayes optimal action is the posterior mean. Under absolute error loss, it is the posterior median. Under 0-1 loss, it is the MAP estimate.
Examples
Coin flipping with Beta-Binomial. Let $X \mid p \sim \text{Binomial}(n, p)$ with prior $p \sim \text{Beta}(\alpha, \beta)$. The posterior is
$$p(p \mid X = k) \propto p^k(1-p)^{n-k} \cdot p^{\alpha-1}(1-p)^{\beta-1} = p^{\alpha+k-1}(1-p)^{\beta+n-k-1},$$
which is $\text{Beta}(\alpha + k, \beta + n - k)$. The posterior mean is $\frac{\alpha+k}{\alpha+\beta+n}$, a shrinkage of the MLE $k/n$ toward the prior mean $\alpha/(\alpha+\beta)$. With $\alpha = \beta = 1$ (uniform prior), the posterior mean is $(k+1)/(n+2)$ - Laplace’s rule of succession.
Bayesian linear regression. In the model $y = X\beta + \varepsilon$, $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$, with prior $\beta \sim \mathcal{N}(0, \lambda^{-1}I)$, the posterior mean is
$$\hat{\beta}_{\text{Bayes}} = (X^\top X + \lambda\sigma^2 I)^{-1}X^\top y,$$
which is exactly the ridge regression estimator. Bayesian inference thus reveals ridge regression as MAP estimation under a Gaussian prior, connecting regularization to Bayesian shrinkage.
Read Next: