Bayesian Inference - Updating Your Beliefs the Mathematically Correct Way
Helpful context:
- Statistics - Turning Data Into Defensible Claims
- Conditional Probability - What You Already Know Changes Everything
- Probability Distributions - The Shapes That Randomness Takes
Sherlock Holmes never waits for certainty. In A Study in Scarlet, he meets Watson for the first time and immediately deduces he has returned from Afghanistan. He does not say “I cannot determine your history without more data.” He reasons from available clues - tan line, posture, arm injury - to the most probable explanation, then updates as new evidence arrives.
This is Bayesian inference, made mathematically precise. It is a different philosophy from classical statistics, and understanding both matters - because they are answering subtly different questions.
Two Philosophies
Before we do any math, we need to understand what the two camps are actually disagreeing about.
Frequentist statistics treats parameters as fixed, unknown constants. The word “probability” means long-run frequency - the fraction of times an event occurs in infinitely many repetitions of an experiment. The parameter $\theta$ is not random; it has some true fixed value. Data are random (because they come from a random experiment). You never assign a probability to the statement “$\theta = 0.6$” because $\theta$ is fixed; asking for its probability is a category error.
Bayesian statistics treats parameters as uncertain quantities, described by probability distributions. The word “probability” means degree of belief - how confident you are that something is true, given what you know. The parameter $\theta$ is random in the sense that you are uncertain about it. Data are also random, but once observed they become fixed. You update your beliefs about $\theta$ in light of the data.
Neither philosophy is “correct.” They are different frameworks for different questions. The frequentist asks: “If this experiment were repeated many times, how would my estimator behave?” The Bayesian asks: “Given the data I actually observed, what should I believe now?”
Bayes' Theorem for Inference
The engine of Bayesian inference is Bayes' theorem, applied to the parameter $\theta$ and the data:
$$P(\theta \mid \text{data}) = \frac{P(\text{data} \mid \theta) \cdot P(\theta)}{P(\text{data})}.$$
Each piece has a name and a role.
Prior $P(\theta)$: your belief about $\theta$ before seeing any data. It encodes what you knew going in - from domain knowledge, previous experiments, or deliberate ignorance.
Likelihood $P(\text{data} \mid \theta)$: how probable the observed data would be if the parameter were $\theta$. This is the same object frequentists work with; the philosophies diverge in what they do with it.
Posterior $P(\theta \mid \text{data})$: your updated belief about $\theta$ after seeing the data. This is what you want.
Normalizing constant $P(\text{data}) = \int P(\text{data} \mid \theta) P(\theta) d\theta$: a constant that makes the posterior integrate to 1. It is often intractable to compute in closed form, which is why most practical Bayesian computation relies on sampling methods like MCMC. For now, we often write:
$$P(\theta \mid \text{data}) \propto P(\text{data} \mid \theta) \cdot P(\theta).$$
The proportionality sign means “equal up to a constant not depending on $\theta$.” Since $P(\text{data})$ does not depend on $\theta$, it is just a normalizing factor we can often ignore when we only care about the shape of the posterior.
A Worked Example: Coin Flipping
Let us go through a complete example with enough detail to make each piece concrete.
You have a coin. You do not know whether it is fair. You flip it $n = 5$ times and observe $k = 3$ heads. What should you believe about $p$, the probability of heads?
Step 1: Choose a likelihood
Each flip is Bernoulli($p$). With $n$ flips and $k$ heads:
$$P(\text{data} \mid p) = \binom{n}{k} p^k (1-p)^{n-k} \propto p^k (1-p)^{n-k}.$$
The binomial coefficient does not depend on $p$, so we absorb it into the proportionality constant.
Step 2: Choose a prior
You have no strong prior beliefs. You want a prior that is flat over $[0,1]$ - every value of $p$ equally plausible. This is the Uniform distribution on $[0,1]$, which is the same as $\text{Beta}(1,1)$:
$$P(p) = 1 \quad \text{for } p \in [0,1].$$
Step 3: Compute the posterior
$$P(p \mid \text{data}) \propto p^k (1-p)^{n-k} \cdot 1 = p^k (1-p)^{n-k}.$$
This is proportional to $p^{(k+1)-1}(1-p)^{(n-k+1)-1}$, which is the kernel of a $\text{Beta}(k+1, n-k+1)$ distribution. With $k=3, n=5$:
$$P(p \mid \text{data}) = \text{Beta}(4, 3).$$
The Beta distribution is defined on $[0,1]$, has a tractable normalizing constant $B(\alpha,\beta) = \Gamma(\alpha)\Gamma(\beta)/\Gamma(\alpha+\beta)$, and has mean $\alpha/(\alpha+\beta)$. So the posterior mean is $4/(4+3) = 4/7 \approx 0.571$.
What the posterior tells you
With $k=3, n=5$: the posterior is $\text{Beta}(4,3)$. It peaks around 0.6 but is spread out - you have only five data points, so uncertainty is high.
Now imagine you observed $k=30$ heads in $n=50$ flips. Same ratio, more data. The posterior is $\text{Beta}(31, 21)$. It still peaks near 0.6 but is now much sharper. The data has overwhelmed the prior: the posterior is almost entirely determined by the likelihood.
This is a general principle: with enough data, all reasonable priors converge to the same posterior. The prior matters most when data is scarce.
Discomfort check. The posterior $\text{Beta}(k+1, n-k+1)$ came out to be in the same family as the prior $\text{Beta}(1,1)$. Is that a coincidence? No - it is a property called conjugacy. The Beta distribution is the conjugate prior for the Bernoulli likelihood. The math works out cleanly because $p^k(1-p)^{n-k}$ and $p^{\alpha-1}(1-p)^{\beta-1}$ have the same functional form. We explore this systematically in Conjugate Priors - Bayesian Updates That Stay in the Same Family .
Point Estimates from the Posterior
Sometimes you need a single number rather than a full distribution. There are three natural candidates.
MAP estimate (Maximum A Posteriori): the mode of the posterior - the value of $\theta$ that maximizes $P(\theta \mid \text{data})$. Since $P(\theta \mid \text{data}) \propto P(\text{data} \mid \theta) P(\theta)$, the MAP estimate maximizes the product of likelihood and prior. With a uniform prior, the MAP equals the maximum likelihood estimate (MLE), because the prior contributes a constant and does not change where the maximum is.
Posterior mean: $\mathbb{E}[\theta \mid \text{data}] = \int \theta P(\theta \mid \text{data}) d\theta$. For the Beta posterior, this is $(\alpha + k)/(\alpha + \beta + n)$. With a uniform prior ($\alpha=\beta=1$), the posterior mean is $(k+1)/(n+2)$, which is a smoothed version of the MLE $k/n$. The posterior mean often has better properties for prediction than MAP, especially under squared-error loss.
Posterior median: the value that splits the posterior into equal halves. Minimizes expected absolute error. Less commonly used but sometimes more robust.
Which estimate to report depends on your loss function and use case. The full posterior is always more informative than any point estimate - a point estimate discards the uncertainty.
Credible Intervals
The Bayesian analogue of a confidence interval is a credible interval. The 95% credible interval $[a, b]$ satisfies:
$$P(a \leq \theta \leq b \mid \text{data}) = 0.95.$$
This is the natural statement most people want to make: given what we observed, there is a 95% probability that $\theta$ lies in this range. Crucially, this interpretation is correct for credible intervals but not for frequentist confidence intervals (see Confidence Intervals - The Estimate and Its Honest Margin of Error for why).
The most common choice is the Highest Density Interval (HDI): the shortest interval containing 95% of the posterior mass. For symmetric unimodal posteriors, this coincides with the equal-tailed interval.
Predictive Distributions
The posterior is not just for estimating $\theta$ - it is also for making predictions. Suppose you want to predict a new observation $x_\text{new}$ given data. The Bayesian approach is:
$$P(x_\text{new} \mid \text{data}) = \int P(x_\text{new} \mid \theta) P(\theta \mid \text{data}) d\theta.$$
This averages predictions over all possible values of $\theta$, weighted by how probable each $\theta$ is given the data. This is the posterior predictive distribution. It is more honest than plugging in a single estimate of $\theta$, because it accounts for parameter uncertainty.
With the coin example: to predict the outcome of the next flip, you integrate over all possible values of $p$ weighted by $\text{Beta}(k+1, n-k+1)$. The result: $P(\text{heads} \mid \text{data}) = (k+1)/(n+2)$, the posterior mean. Prediction and posterior mean coincide for Bernoulli outcomes - a pleasing consistency.
“But Isn’t the Prior Subjective?”
Discomfort check. Yes. Different priors give different posteriors, particularly with small datasets. This is the central objection to Bayesian methods. Here is how to think about it.
First: with enough data, the likelihood dominates and all reasonable priors give nearly the same posterior. The prior’s influence shrinks like $1/n$. If you have 10,000 observations and someone else has a mildly different prior, your posteriors will be essentially identical.
Second: the prior is at least explicit. Frequentist methods also make modeling assumptions - they assume a likelihood function, a test statistic, a significance threshold - they just do not make the prior visible. The Bayesian approach forces you to state your assumptions, which makes them auditable and debatable. An explicit, defensible prior is better than an implicit, hidden one.
Third: in machine learning, the prior corresponds to regularization. This is not a metaphor - it is exact.
Bayesian Inference and Regularization
The connection between Bayesian MAP estimation and regularization is one of the most useful facts in machine learning.
Suppose you have a model with weights $\mathbf{w}$. You observe data and want to maximize the posterior:
$$\mathbf{w}\text{MAP} = \arg\max{\mathbf{w}} P(\mathbf{w} \mid \text{data}) = \arg\max_{\mathbf{w}} \left[ \log P(\text{data} \mid \mathbf{w}) + \log P(\mathbf{w}) \right].$$
Gaussian prior $P(\mathbf{w}) \propto \exp(-|\mathbf{w}|^2 / 2\sigma^2)$: the log prior is $-|\mathbf{w}|^2/(2\sigma^2)$. Adding this to the log likelihood gives you $L2$ regularization (ridge regression, weight decay). The regularization coefficient $\lambda$ is $1/(2\sigma^2)$.
Laplace prior $P(\mathbf{w}) \propto \exp(-|\mathbf{w}|/b)$: the log prior is $-|\mathbf{w}|/b$. Adding this gives you $L1$ regularization (lasso), which promotes sparsity.
Regularization, usually introduced as a computational trick to prevent overfitting, is Bayesian MAP estimation in disguise. The strength of regularization encodes the strength of the prior. Tuning the regularization coefficient via cross-validation is, in Bayesian terms, empirically estimating the prior hyperparameters.
When Does the Philosophy Matter?
In many practical settings, Bayesian and frequentist methods give the same point estimates (MAP = MLE with uniform prior). The differences become important in specific situations.
Small data. With few observations, the prior exerts substantial influence on the posterior. If your prior is well-chosen, this is a feature (you are incorporating domain knowledge). If poorly chosen, it is a liability.
Sequential updating. Bayesian inference is naturally sequential: today’s posterior becomes tomorrow’s prior. You can update beliefs as data arrives without rerunning the entire analysis. Frequentist methods require more careful handling of sequential experiments (see optional stopping rules and their pitfalls).
Nuisance parameters. If $\theta = (\psi, \lambda)$ and you only care about $\psi$ while $\lambda$ is a nuisance, the Bayesian approach marginalizes over $\lambda$: $P(\psi \mid \text{data}) = \int P(\psi, \lambda \mid \text{data}) d\lambda$. Frequentist approaches to nuisance parameters (profile likelihood, conditioning) can be more ad hoc.
Interval interpretation. Credible intervals mean what you want them to mean. Confidence intervals are a property of the procedure, not of the specific interval you computed. When the audience expects probability statements about parameters, credible intervals are more honest.
Summary
| Concept | Definition |
|---|---|
| Prior $P(\theta)$ | Belief about $\theta$ before seeing data |
| Likelihood $P(\text{data}\mid\theta)$ | Probability of data if $\theta$ were true |
| Posterior $P(\theta\mid\text{data})$ | Updated belief after data, $\propto$ likelihood $\times$ prior |
| MAP estimate | Mode of the posterior |
| Posterior mean | $\mathbb{E}[\theta\mid\text{data}]$, minimizes squared error |
| Credible interval | $P(a \leq \theta \leq b \mid \text{data}) = 0.95$ |
| Posterior predictive | $\int P(x_\text{new}\mid\theta)P(\theta\mid\text{data})d\theta$ |
| MAP with Gaussian prior | Equivalent to L2 regularization |
| MAP with Laplace prior | Equivalent to L1 regularization |
Bayesian inference is not a replacement for frequentist methods - it is a different answer to a different question. The frequentist asks about the long-run behavior of a procedure. The Bayesian asks about what to believe, right now, given the evidence in hand.
Both matter. Both appear throughout statistics and machine learning. Knowing which question you are answering - and which framework answers it - is what separates a careful analyst from a confused one.
Read next: