Helpful context:


You have collected 1000 measurements of some physical quantity - say, the boiling point of a compound in your lab. You compute an average: 98.3 degrees. But how confident are you that the true boiling point is near 98.3? What if you had only taken 10 measurements? Would you be equally confident? And when you say “the true boiling point,” what exactly do you mean - is there a single true value, and if so, can you ever know it?

These are the questions statistics exists to answer. Not just “what does the data say?” but “how much should we trust what the data says, and why?”


The Population-Sample Gap

Statistics lives in the tension between what we have and what we want to know.

The population is the complete set of all units we care about - every patient with a certain condition, every atom in a sample, every outcome of a repeating experiment. The population distribution is the true underlying distribution over this entire population, described by some parameter $\theta$ (possibly a vector: a mean, a variance, a set of regression coefficients).

The sample is what we actually observe: $X_1, X_2, \ldots, X_n$, a collection of $n$ measurements drawn from the population. We assume they are independent and identically distributed (i.i.d.), each following the population distribution.

The inferential gap: we want to say something about $\theta$ - a property of the entire population - but we only have the sample. The sample is finite, noisy, and never perfectly representative. Everything in statistics is about managing this gap honestly.

To navigate it, we distinguish two kinds of quantities:

  • A parameter is a fixed (unknown) feature of the population: the true mean $\mu$, the true variance $\sigma^2$, the true proportion $p$. It does not change - we just do not know it.
  • A statistic is any function of the data: $\bar{X}$, $s^2$, the sample maximum. It varies across different samples drawn from the same population.

When we use a statistic to estimate a parameter, we call it an estimator. The sample mean $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$ is the natural estimator of the population mean $\mu$.


What Makes a Good Estimator?

An estimator $\hat{\theta} = T(X_1, \ldots, X_n)$ is itself a random variable - it takes different values for different samples. How do we judge it?

Bias

The bias of an estimator is how far its expected value is from the true parameter:

$$\text{Bias}(\hat{\theta}) = E[\hat{\theta}] - \theta.$$

An estimator is unbiased if $E[\hat{\theta}] = \theta$ for all possible values of $\theta$. Unbiasedness means that if you repeated the experiment infinitely many times and averaged all your estimates, you would land on the truth. It does not mean any single estimate is correct.

Variance

The variance of an estimator measures how spread out it is across different samples:

$$\text{Var}(\hat{\theta}) = E\left[(\hat{\theta} - E[\hat{\theta}])^2\right].$$

Low variance means the estimator gives consistent answers across different samples, even if those answers might be systematically off-target.

Mean Squared Error and the Bias-Variance Tradeoff

Neither bias nor variance alone determines an estimator’s quality. The mean squared error combines both:

$$\text{MSE}(\hat{\theta}) = E[(\hat{\theta} - \theta)^2] = \text{Bias}(\hat{\theta})^2 + \text{Var}(\hat{\theta}).$$

To see why this decomposition holds, let $b = \text{Bias}(\hat{\theta})$ and write:

$$E[(\hat{\theta} - \theta)^2] = E[(\hat{\theta} - E[\hat{\theta}] + b)^2] = E[(\hat{\theta} - E[\hat{\theta}])^2] + 2b \cdot \underbrace{E[\hat{\theta} - E[\hat{\theta}]]}_{=0} + b^2 = \text{Var}(\hat{\theta}) + b^2.$$

The bias-variance tradeoff: sometimes a slightly biased estimator with much lower variance beats an unbiased estimator with high variance, in terms of total MSE. This is the statistical version of the machine learning bias-variance tradeoff you encounter in regularization and model complexity.

Consistency

An estimator $\hat{\theta}_n$ is consistent if it converges in probability to the true parameter as the sample grows:

$$\hat{\theta}_n \xrightarrow{P} \theta \quad \text{as } n \to \infty.$$

Consistency is a minimal requirement: an estimator that stays wrong no matter how much data you collect is useless. A sufficient condition for consistency is $\text{MSE}(\hat{\theta}_n) \to 0$ as $n \to \infty$.


The Sample Mean

The sample mean $\bar{X} = \frac{1}{n}\sum_{i=1}^n X_i$ estimates the population mean $\mu = E[X]$.

It is unbiased: $E[\bar{X}] = \frac{1}{n}\sum_{i=1}^n E[X_i] = \mu$.

Its variance is $\text{Var}(\bar{X}) = \frac{\sigma^2}{n}$, where $\sigma^2 = \text{Var}(X_i)$. This follows from independence: the variance of a sum of independent variables equals the sum of variances, and the $\frac{1}{n^2}$ from squaring the $\frac{1}{n}$ combines with the sum of $n$ identical variances to give $\frac{\sigma^2}{n}$.

Its MSE is $\text{Bias}^2 + \text{Var} = 0 + \frac{\sigma^2}{n} = \frac{\sigma^2}{n} \to 0$, so the sample mean is consistent.

Two observations worth sitting with. First, the variance of $\bar{X}$ shrinks like $1/n$ - halving the uncertainty requires quadrupling the sample size. Second, the formula $\sigma^2/n$ depends on the true population variance $\sigma^2$, which we usually do not know. Estimating it is the next problem.


The Sample Variance and Bessel’s Correction

The natural guess for estimating $\sigma^2 = E[(X - \mu)^2]$ would be $\frac{1}{n}\sum_{i=1}^n (X_i - \mu)^2$. But we do not know $\mu$ - we only have $\bar{X}$.

When we plug in $\bar{X}$ and use $\frac{1}{n}$, the result is systematically too small. Intuitively: $\bar{X}$ was estimated from the same data, so $(X_i - \bar{X})$ tends to be smaller than $(X_i - \mu)$. The sample mean is the value that minimizes the sum of squared deviations from the data, so plugging it in instead of the true mean artificially deflates the variance estimate.

The fix is Bessel’s correction: use $n-1$ in the denominator instead of $n$:

$$s^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2.$$

Theorem. $E[s^2] = \sigma^2$ - the sample variance with $n-1$ is unbiased.

Proof. Expand $\sum(X_i - \bar{X})^2 = \sum X_i^2 - n\bar{X}^2$. Taking expectations:

$$E\left[\sum X_i^2\right] = n(\sigma^2 + \mu^2), \qquad E[n\bar{X}^2] = n\left(\frac{\sigma^2}{n} + \mu^2\right) = \sigma^2 + n\mu^2.$$

Therefore $E\left[\sum(X_i - \bar{X})^2\right] = n(\sigma^2 + \mu^2) - (\sigma^2 + n\mu^2) = (n-1)\sigma^2$, so $E[s^2] = \sigma^2$. $\square$

The $n-1$ in the denominator reflects the loss of one degree of freedom: once you know $\bar{X}$ and the first $n-1$ values of $X_i$, the last value is determined. The deviations $(X_i - \bar{X})$ are constrained to sum to zero, so they carry only $n-1$ independent pieces of information about spread.


Sampling Distributions

An estimator is a random variable. Its probability distribution - across all possible samples of size $n$ from the population - is called its sampling distribution.

For the sample mean, the Central Limit Theorem gives us the sampling distribution precisely: regardless of the shape of the population distribution (provided it has finite variance), for large $n$:

$$\bar{X} \sim \text{approximately } \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right).$$

The standard error is the standard deviation of the sampling distribution of $\bar{X}$:

$$\text{SE}(\bar{X}) = \frac{\sigma}{\sqrt{n}}.$$

Discomfort check. “Standard error” and “standard deviation” measure different things and must not be confused. The standard deviation $\sigma$ (or its estimate $s$) describes how spread out individual observations are around the population mean. It converges to the true $\sigma$ as $n$ grows, but it does not shrink - individual measurements stay just as variable. The standard error $\sigma/\sqrt{n}$ describes how spread out the sample mean is around the true mean. It shrinks as $n$ grows, reflecting that larger samples produce more reliable averages. If you see a published result reporting a mean and you want to know how precise that estimate is, you want the standard error, not the standard deviation.


The t-Distribution: When $\sigma$ is Unknown

The CLT says $\bar{X}$ is approximately normal with standard deviation $\sigma/\sqrt{n}$. But $\sigma$ is unknown. The natural move: replace $\sigma$ with $s$, the sample standard deviation.

The standardized quantity

$$T = \frac{\bar{X} - \mu}{s / \sqrt{n}}$$

does not follow a standard normal distribution. Instead, it follows a t-distribution with $n-1$ degrees of freedom, written $T \sim t_{n-1}$.

The t-distribution looks like a normal distribution but with heavier tails - it assigns more probability to extreme values. This captures the additional uncertainty from estimating $\sigma$ with $s$. When $n$ is small, $s$ can be quite far from $\sigma$, and the t-distribution’s fat tails honestly represent the resulting uncertainty.

As $n \to \infty$, the t-distribution converges to the standard normal: with large samples, $s \approx \sigma$ and the extra uncertainty becomes negligible.

The degrees of freedom parameter $n-1$ arises for the same reason as in Bessel’s correction: estimating $\mu$ with $\bar{X}$ costs one degree of freedom, leaving $n-1$ to estimate spread.


Confidence Intervals: a Glimpse

The sampling distribution of $\bar{X}$ tells us how far, on average, the sample mean falls from $\mu$. Inverting this statement gives a confidence interval: a random interval, constructed from the data, that contains $\mu$ with a specified probability (the coverage probability, typically 95%).

For large samples where the normal approximation holds, a 95% confidence interval for $\mu$ is:

$$\bar{X} \pm 1.96 \cdot \frac{s}{\sqrt{n}}.$$

For small samples, replace 1.96 with the appropriate quantile of the $t_{n-1}$ distribution (which is larger than 1.96, reflecting the heavier tails).

Discomfort check. The correct interpretation of a 95% confidence interval is subtle. It does not mean “there is a 95% probability that $\mu$ lies in this interval” - $\mu$ is a fixed number, not a random variable (in the frequentist framework), so probability statements about $\mu$ do not make sense. The correct statement: “if we repeated this procedure on many different samples, 95% of the resulting intervals would contain $\mu$.” The randomness is in the interval, not in the parameter. This distinction matters in practice: a single computed interval either contains $\mu$ or it does not - we just do not know which.


Hypothesis Testing: a Glimpse

Hypothesis testing formalizes the question “is this effect real, or could it have arisen by chance?” The logic:

  1. State a null hypothesis $H_0$ (e.g., $\mu = 0$ - no effect).
  2. Compute a test statistic (e.g., $T = \bar{X}/(s/\sqrt{n})$) whose distribution under $H_0$ is known.
  3. Compute the p-value: the probability, assuming $H_0$ is true, of observing a test statistic as extreme as the one we got.
  4. If the p-value is very small, the data would be surprising if $H_0$ were true - evidence against $H_0$.

The threshold for “very small” (the significance level $\alpha$, typically 0.05) is set in advance. Both confidence intervals and hypothesis testing rest on the same foundation: the sampling distribution. The read-next posts treat each in full detail.


Summary

Concept Formula / Statement
Bias $\text{Bias}(\hat\theta) = E[\hat\theta] - \theta$
MSE decomposition $\text{MSE} = \text{Bias}^2 + \text{Var}$
Consistency $\hat\theta_n \xrightarrow{P} \theta$ as $n \to \infty$
Sample mean $\bar X = \frac{1}{n}\sum X_i$; unbiased, variance $\sigma^2/n$
Bessel’s correction $s^2 = \frac{1}{n-1}\sum(X_i - \bar X)^2$; unbiased for $\sigma^2$
Standard error $\text{SE}(\bar X) = \sigma/\sqrt n$ (shrinks with sample size)
t-statistic $T = (\bar X - \mu)/(s/\sqrt n) \sim t_{n-1}$ when $\sigma$ unknown

Statistics is applied uncertainty quantification. The sample gives you a number; the theory tells you how much to trust it, how it would vary across different experiments, and what you can honestly claim about the world that generated it.


Read next: