Confidence Intervals - The Estimate and Its Honest Margin of Error // Megha Bose

Helpful context:

A poll says 52% of voters support the candidate, with a ±3% margin of error. What does that mean?

Most people read it as: “There is a 95% chance the true support is between 49% and 55%.” That interpretation feels natural, it feels like what you want to know, and it is wrong.

Understanding why it is wrong - and what the correct interpretation actually says - is one of the sharper conceptual distinctions in statistics. It matters not just for polls but for every confidence interval in science, medicine, and machine learning.

Building the Interval from the CLT

Let us construct a confidence interval from scratch. You have a random sample $x_1, \ldots, x_n$ from a population with true mean $\mu$ and known standard deviation $\sigma$. The sample mean $\bar{x}$ is your estimator.

By the Central Limit Theorem, for large $n$:

$$\bar{x} \approx \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right).$$

Standardize: $Z = \frac{\bar{x} - \mu}{\sigma/\sqrt{n}} \approx \mathcal{N}(0,1)$.

Now find the 95% central region of the standard normal. The critical value $z_{0.025} = 1.96$ satisfies $P(-1.96 \leq Z \leq 1.96) = 0.95$. Substituting back:

$$P\left(-1.96 \leq \frac{\bar{x} - \mu}{\sigma/\sqrt{n}} \leq 1.96\right) = 0.95.$$

Rearrange the inequality to isolate $\mu$:

$$P\left(\bar{x} - \frac{1.96\sigma}{\sqrt{n}} \leq \mu \leq \bar{x} + \frac{1.96\sigma}{\sqrt{n}}\right) = 0.95.$$

This gives the 95% confidence interval:

$$\left[\bar{x} - \frac{1.96\sigma}{\sqrt{n}},; \bar{x} + \frac{1.96\sigma}{\sqrt{n}}\right].$$

The Correct Interpretation

Here is the statement that produced this interval: the probability is 0.95, before you collect data, that the random interval $[\bar{x} - 1.96\sigma/\sqrt{n}, \bar{x} + 1.96\sigma/\sqrt{n}]$ will contain $\mu$.

The randomness is in $\bar{x}$, not in $\mu$. The true mean $\mu$ is a fixed unknown constant. The interval is random because $\bar{x}$ varies from sample to sample.

The correct interpretation: If you repeated this experiment many times and computed a 95% CI each time, about 95% of those intervals would contain the true $\mu$.

The incorrect interpretation: There is a 95% probability that $\mu$ is in the specific interval $[49%, 55%]$.

The distinction: once you compute the specific interval $[49%, 55%]$, the true proportion is either in that interval or not. There is no randomness left. The 95% is a statement about the method - the procedure that generates intervals - not about this particular interval.

Discomfort check. This feels like philosophical hairsplitting. In practice, many researchers say “95% confidence interval” and mean “I am 95% confident the true value is in here,” which is the Bayesian credible interval interpretation. The frequentist CI does not support this interpretation. In many settings the numerical results are similar enough that it does not matter. But in settings with strong prior information or very small samples, the two can differ substantially. And in principle, the correct interpretation of what you computed matters.

Why the Distinction Matters

The Bayesian credible interval has the natural interpretation: $P(a \leq \mu \leq b \mid \text{data}) = 0.95$. This is a probability statement about $\mu$ given the observed data, which requires a prior.

The frequentist confidence interval has the procedural interpretation: the procedure produces correct intervals 95% of the time.

Both are useful. Neither is “better” - they answer different questions. The frequentist CI requires no prior; the Bayesian CI requires a prior but gives you the probability statement you intuitively want.

When you report a CI, be clear about which one you computed and what claim it supports. Most published CIs are frequentist, but most readers interpret them as Bayesian - a pervasive mismatch.

CI for a Proportion

The drug company wants to know what fraction $p$ of patients respond to treatment. You observe $k$ responses in $n$ trials, giving $\hat{p} = k/n$.

By the CLT, $\hat{p} \approx \mathcal{N}(p, p(1-p)/n)$. Since $p$ is unknown, substitute $\hat{p}$ for the variance:

$$\hat{p} \pm 1.96\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}.$$

This is the Wald interval. The term $1.96\sqrt{\hat{p}(1-\hat{p})/n}$ is the margin of error - the ±3% in the poll example.

Several features of the margin of error are worth noting.

Widest at $\hat{p} = 0.5$. The quantity $\hat{p}(1-\hat{p})$ is maximized when $\hat{p} = 1/2$, giving margin of error $1.96/(2\sqrt{n})$. Polls often quote the “maximum margin of error” corresponding to $\hat{p} = 0.5$, which is a conservative bound valid regardless of the true proportion.

Narrowest near 0 or 1. If $\hat{p} = 0.02$, the standard error is small and the CI is narrow - but the normal approximation can also be poor. The Wald interval is unreliable near the boundary. Alternative intervals (Wilson, Clopper-Pearson) perform better in this regime.

Depends on $n$, not on the population size. A poll of $n = 1000$ people has the same margin of error whether the population is 100,000 or 100 million. This surprises many people, but it is a consequence of the CLT: what matters is the number of independent observations, not the population they came from.

t-intervals: Unknown Standard Deviation

In the construction above, we assumed $\sigma$ was known. In practice it almost never is. We replace $\sigma$ with $s = \sqrt{\frac{1}{n-1}\sum(x_i - \bar{x})^2}$:

$$T = \frac{\bar{x} - \mu}{s/\sqrt{n}} \sim t_{n-1}$$

under the null (see Hypothesis Testing & p-values - The Formal Language of Probably Not a Coincidence ). The t-distribution has heavier tails than the normal, reflecting the extra uncertainty from estimating $\sigma$.

The 95% t-interval is:

$$\bar{x} \pm t_{n-1, 0.025} \cdot \frac{s}{\sqrt{n}},$$

where $t_{n-1, 0.025}$ is the 97.5th percentile of the $t_{n-1}$ distribution. For large $n$, $t_{n-1, 0.025} \approx 1.96$. For small samples, it is notably larger:

$n$	$t_{n-1, 0.025}$
5	2.776
10	2.228
30	2.042
60	2.000
$\infty$	1.960

With only 5 observations, you need to extend the interval to $\pm 2.776$ standard errors to achieve 95% coverage - substantially wider than the asymptotic $\pm 1.96$. The t-distribution is quantifying your uncertainty not just about $\mu$ but also about $\sigma$.

How Width Scales with Sample Size

The CI half-width is proportional to $\sigma/\sqrt{n}$ (or $s/\sqrt{n}$ in the t-case). This has a direct practical implication:

To halve the margin of error, you need four times as much data.

If $n = 100$ gives a margin of error of ±3%, then $n = 400$ gives ±1.5%, and $n = 1600$ gives ±0.75%.

This is the fundamental cost of precision. The $\sqrt{n}$ in the denominator comes directly from the CLT - averaging $n$ independent observations reduces variance by a factor of $n$, so standard deviation shrinks by $\sqrt{n}$.

The implication for experiment design: getting from 90% precision to 95% precision is not twice the cost, it is four times the cost. Unreasonable precision requirements are expensive.

Bootstrap Confidence Intervals

Both the z-interval and t-interval rely on assumptions: approximate normality of $\bar{x}$ (via CLT), or exact normality of the data (for small-sample t). What if your statistic is not a sample mean? What if you want a CI for a median, a correlation coefficient, a model accuracy metric, or some other complex quantity?

The bootstrap provides a distribution-free alternative.

Procedure:

You have a dataset of $n$ observations.
Draw $B$ bootstrap samples: each is a sample of size $n$ drawn with replacement from your original data.
Compute your statistic (mean, median, correlation, etc.) on each bootstrap sample. Call these $\hat{\theta}^{}_1, \ldots, \hat{\theta}^{}_B$.
The 95% CI is the interval from the 2.5th to 97.5th percentile of $\{\hat{\theta}^{*}_b\}$.

The key insight: the distribution of $\hat{\theta}^* - \hat{\theta}$ (where $\hat{\theta}$ is the estimate from your original data) approximates the distribution of $\hat{\theta} - \theta$ (the true sampling error). You use the bootstrap distribution as a proxy for the sampling distribution.

The bootstrap requires no normality assumption and no formula for the variance of your statistic. It works for nearly any statistic, as long as your sample is a reasonable representation of the population. The main costs are computational ($B = 1000$ to $10000$ resamples is typical) and the assumption that your sample represents the population (the bootstrap cannot fix a biased sample).

Discomfort check. A narrower confidence interval is not always better. If you misspecify the model - assume normality when data is heavy-tailed, assume independent observations when they are correlated - the CI can be far too narrow, giving false precision. The CI is only as reliable as the assumptions underlying it. A bootstrap CI is more robust to distributional assumptions but still requires that the sample be representative and observations be independent.

CI and Hypothesis Testing: Two Views of the Same Thing

Confidence intervals and hypothesis tests are deeply connected. A 95% CI for $\mu$ contains exactly those values of $\mu_0$ that would not be rejected at significance level $\alpha = 0.05$ by a two-sided t-test.

This means:

If the CI does not contain zero, the two-sided test at $\alpha = 0.05$ would reject $H_0: \mu = 0$.
If the CI contains zero, the test would fail to reject.

The CI gives strictly more information than the test: instead of a binary reject/fail-to-reject decision, it shows you the entire range of parameter values consistent with the data. Reporting a CI is generally preferred to reporting only a p-value because it communicates effect size and uncertainty simultaneously.

Connection to Machine Learning

Confidence intervals appear throughout ML evaluation, often without being labeled as such.

A/B testing. After running an A/B test on a product change, you compute a CI for the difference in conversion rates. The CI tells you the range of improvements (or regressions) consistent with the data - more informative than a bare p-value.

Model evaluation. Your model achieves 87% accuracy on a test set of $n = 500$ examples. The CI for accuracy is $\hat{p} \pm 1.96\sqrt{\hat{p}(1-\hat{p})/n} \approx 87% \pm 3%$. That ±3% is not negligible: a different test set of the same size might give 84% or 90%.

Hyperparameter sensitivity. Bootstrap CIs for evaluation metrics can reveal whether one hyperparameter setting is significantly better than another, or whether the difference is within noise.

Paired evaluation. When comparing two models on the same test set, the correlated structure of errors means a paired t-test (and its CI) is more powerful than treating the two sets of errors as independent.

Summary

Concept	Definition
95% CI	Interval produced by a procedure that covers $\mu$ 95% of the time
Correct interpretation	95% of such intervals (over repeated experiments) contain $\mu$
Incorrect interpretation	95% probability that $\mu$ is in this specific interval
Margin of error for proportion	$1.96\sqrt{\hat{p}(1-\hat{p})/n}$, widest at $\hat{p} = 0.5$
t-interval	Replace $z_{0.025} = 1.96$ with $t_{n-1, 0.025}$; wider for small $n$
Width scales as	$1/\sqrt{n}$: halve width requires $4\times$ data
Bootstrap CI	Resample with replacement; use empirical quantiles; assumption-free
CI vs. hypothesis test	CI contains all $\mu_0$ not rejected at level $\alpha$

The confidence interval is one of statistics' most useful and most misunderstood tools. Its correct interpretation - a statement about the procedure, not about this specific interval - is awkward precisely because it refuses to say the thing you want to say: “I am 95% sure $\mu$ is in here.” That statement requires a prior and produces a credible interval, not a confidence interval.

Both tools exist. Both are useful. Use the one whose assumptions and interpretation match your question.

Read next: