Confidence Intervals
Prerequisite:
A confidence interval converts a point estimate into a range that accounts for sampling variability. Understanding what it does - and does not - guarantee is essential for interpreting any empirical result correctly.
Formal Definition
Let $X_1, \ldots, X_n$ be i.i.d. from a distribution indexed by parameter $\theta$. A $(1-\alpha)$ confidence interval is a pair of statistics $\hat{L} = \hat{L}(\mathbf{X})$ and $\hat{U} = \hat{U}(\mathbf{X})$ satisfying
$$P(\hat{L} \leq \theta \leq \hat{U}) = 1 - \alpha \quad \text{for all } \theta$$
The interval $[\hat{L}, \hat{U}]$ is random - it varies from sample to sample - while $\theta$ is a fixed (unknown) constant. The correct frequentist interpretation is: if we repeated the experiment many times, $(1-\alpha)$ of the resulting intervals would contain $\theta$.
It is incorrect to say “there is a $(1-\alpha)$ probability that $\theta$ lies in this particular interval.” Once the data are observed, the interval either contains $\theta$ or it does not; there is no probability involved for a fixed realization.
The Pivotal Quantity Method
A pivotal quantity $Q = Q(\mathbf{X}, \theta)$ is a function of data and the parameter whose distribution does not depend on $\theta$. Given a pivot with quantiles $q_{\alpha/2}$ and $q_{1-\alpha/2}$:
$$P(q_{\alpha/2} \leq Q(\mathbf{X}, \theta) \leq q_{1-\alpha/2}) = 1 - \alpha$$
Inverting the inequalities in $\theta$ yields the confidence interval. This inversion step is the mechanical core of CI construction.
CI for the Mean
Known Variance
Let $X_1, \ldots, X_n \overset{iid}{\sim} N(\mu, \sigma^2)$ with $\sigma^2$ known. The pivot is
$$Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \sim N(0,1)$$
Solving $P(-z_{\alpha/2} \leq Z \leq z_{\alpha/2}) = 1-\alpha$ for $\mu$ gives
$$\bar{X} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}$$
The margin of error $z_{\alpha/2},\sigma/\sqrt{n}$ shrinks at rate $1/\sqrt{n}$: to halve the margin of error, one must quadruple the sample size.
Unknown Variance
When $\sigma^2$ is unknown, the pivot becomes
$$T = \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t_{n-1}$$
and the interval is $\bar{X} \pm t_{n-1,,\alpha/2} \cdot S/\sqrt{n}$, where $t_{n-1,,\alpha/2}$ is the upper $\alpha/2$ quantile of the $t$-distribution with $n-1$ degrees of freedom. Since $t_{n-1}$ has heavier tails than $N(0,1)$, this interval is wider, correctly reflecting the extra uncertainty from estimating $\sigma$.
CI for a Proportion
For $X \sim \text{Binomial}(n, p)$, the natural estimator is $\hat{p} = X/n$. By the CLT, the Wald interval is
$$\hat{p} \pm z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
The Wald interval has poor coverage when $p$ is near 0 or 1 or $n$ is small. The Wilson interval, obtained by inverting the score test, achieves substantially better coverage:
$$\frac{\hat{p} + \frac{z^2}{2n}}{1 + \frac{z^2}{n}} \pm \frac{z}{1 + \frac{z^2}{n}}\sqrt{\frac{\hat{p}(1-\hat{p})}{n} + \frac{z^2}{4n^2}}$$
where $z = z_{\alpha/2}$.
Relationship to Hypothesis Tests
There is a fundamental duality between confidence intervals and hypothesis tests:
Theorem. The $(1-\alpha)$ confidence interval for $\theta$ is exactly the set of parameter values that would not be rejected by a level-$\alpha$ test given the observed data:
$$[\hat{L}, \hat{U}] = {\theta_0 : \text{the test of } H_0{:},\theta = \theta_0 \text{ fails to reject at level } \alpha}$$
Equivalently, $H_0: \theta = \theta_0$ is rejected at level $\alpha$ if and only if $\theta_0 \notin [\hat{L}, \hat{U}]$.
This duality means a CI communicates everything a hypothesis test does, and more: it shows not just whether an effect is detectable but how large it is.
Bootstrap Confidence Intervals
When the sampling distribution of an estimator $\hat{\theta}$ is analytically intractable, the bootstrap provides a nonparametric alternative.
Algorithm (Percentile Bootstrap):
- Draw $B$ bootstrap samples $\mathbf{X}^{*(b)}$ by sampling $n$ observations with replacement from the original data.
- Compute $\hat{\theta}^{*(b)}$ on each bootstrap sample.
- The $(1-\alpha)$ CI is $[\hat{\theta}^{(\alpha/2)}, \hat{\theta}^{(1-\alpha/2)}]$, i.e., the $\alpha/2$ and $1-\alpha/2$ empirical quantiles of ${\hat{\theta}^{*(b)}}$.
The BCa (bias-corrected and accelerated) bootstrap adjusts for bias and skewness in the sampling distribution, achieving better second-order coverage accuracy. The percentile bootstrap can be unreliable for heavily skewed estimators or small samples.
Coverage Probability
The coverage probability of a CI procedure is
$$C(\theta) = P_\theta(\hat{L} \leq \theta \leq \hat{U})$$
A procedure is exact if $C(\theta) = 1-\alpha$ for all $\theta$, and conservative if $C(\theta) \geq 1-\alpha$ for all $\theta$. Many textbook intervals are only approximate, with coverage converging to $1-\alpha$ as $n \to \infty$ by the CLT.
Bayesian Credible Intervals vs Frequentist CIs
A Bayesian credible interval $[a, b]$ satisfies
$$P(\theta \in [a,b] \mid \mathbf{X}) = 1 - \alpha$$
where the probability is over the posterior distribution of $\theta$ given data. The highest posterior density (HPD) interval is the shortest such interval.
The interpretation difference is fundamental. The Bayesian credible interval does allow the statement “the probability that $\theta$ lies in $[a,b]$ is $1-\alpha$,” but this probability is subjective - it depends on the prior. The frequentist CI makes no probability statement about $\theta$ conditional on observed data; the randomness is over hypothetical repeated samples.
When the prior is uninformative, Bayesian and frequentist intervals often coincide numerically, but they carry different epistemological meanings.
Examples
Polling Margin of Error. A poll of $n = 1000$ voters finds 52% support for a candidate. The 95% Wilson CI is approximately $(48.9%, 55.1%)$. The headline “52% ± 3.1%” is the Wald margin of error $z_{0.025}\sqrt{\hat{p}(1-\hat{p})/n} \approx 0.031$.
ML Model Performance CIs. After testing a classifier on a held-out set of $n = 500$ examples with accuracy $\hat{p} = 0.876$, a 95% CI is approximately $(0.848, 0.904)$. Comparing two models by whether their CIs overlap is conservative; the correct approach is to form a CI for the difference in accuracies, or use the duality with a two-sample proportion test directly.
Read Next: