Hypothesis Testing & p-Values
Prerequisite:
Hypothesis testing is the formal procedure by which statistics translates data into decisions. Every A/B test, every clinical trial result, every model comparison ultimately rests on the machinery developed here.
The Setup: Null and Alternative Hypotheses
A statistical hypothesis is a claim about a population parameter $\theta$. We formulate two competing hypotheses:
- Null hypothesis $H_0$: the default or “no-effect” claim, e.g. $\theta = \theta_0$
- Alternative hypothesis $H_1$: what we hope to detect, e.g. $\theta \neq \theta_0$ (two-sided) or $\theta > \theta_0$ (one-sided)
The philosophy is asymmetric: we assume $H_0$ is true and ask whether the data are inconsistent enough with that assumption to warrant rejection. We never “accept” $H_0$; we either reject it or fail to reject it.
Test Statistic and Rejection Region
A test statistic $T = T(X_1, \ldots, X_n)$ is a function of the data whose distribution under $H_0$ is known. The rejection region $\mathcal{R}$ is the set of values of $T$ for which we reject $H_0$.
The significance level $\alpha \in (0,1)$ is the probability of rejecting $H_0$ when it is true:
$$\alpha = P(T \in \mathcal{R} \mid H_0)$$
Typical choices are $\alpha = 0.05$ or $\alpha = 0.01$. The rejection region is chosen so that this probability equals exactly $\alpha$.
Type I and Type II Errors
| $H_0$ true | $H_0$ false | |
|---|---|---|
| Reject $H_0$ | Type I error (prob $= \alpha$) | Correct (prob $= 1-\beta$) |
| Fail to reject $H_0$ | Correct (prob $= 1-\alpha$) | Type II error (prob $= \beta$) |
- Type I error (false positive): rejecting a true $H_0$. Controlled directly by $\alpha$.
- Type II error (false negative): failing to reject a false $H_0$. Its probability $\beta$ depends on the true value of $\theta$ and the sample size $n$.
- Power $= 1 - \beta$: the probability of correctly detecting a true effect. For a specific alternative $\theta_1$, power $= P(T \in \mathcal{R} \mid \theta = \theta_1)$.
There is an inherent trade-off: decreasing $\alpha$ increases $\beta$ for fixed $n$. The only way to reduce both simultaneously is to increase $n$.
p-Values
The p-value is
$$p = P(T \geq t_{\text{obs}} \mid H_0)$$
where $t_{\text{obs}}$ is the observed value of the test statistic, and “$ \geq$” is interpreted as “at least as extreme as.” For a two-sided test, $p = P(|T| \geq |t_{\text{obs}}| \mid H_0)$.
We reject $H_0$ at level $\alpha$ if and only if $p \leq \alpha$.
Critical misinterpretation warning. The p-value is not the probability that $H_0$ is true. It is the probability of observing data at least this extreme assuming $H_0$ is true. A p-value of 0.03 does not mean there is a 3% chance the null hypothesis holds - it means the data would arise only 3% of the time if $H_0$ were true.
One-Sample Tests
z-Test (Known Variance)
Let $X_1, \ldots, X_n \overset{iid}{\sim} N(\mu, \sigma^2)$ with $\sigma^2$ known. Test $H_0: \mu = \mu_0$ vs $H_1: \mu \neq \mu_0$.
The test statistic is
$$Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} \overset{H_0}{\sim} N(0,1)$$
Reject $H_0$ when $|Z| > z_{\alpha/2}$, where $z_{\alpha/2}$ is the upper $\alpha/2$ quantile of $N(0,1)$.
t-Test (Unknown Variance)
When $\sigma^2$ is unknown, we replace it with the sample variance $S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar{X})^2$:
$$T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} \overset{H_0}{\sim} t_{n-1}$$
The resulting distribution is Student’s t-distribution with $n-1$ degrees of freedom. The t-distribution has heavier tails than the normal, reflecting the additional uncertainty from estimating $\sigma$. As $n \to \infty$, $t_{n-1} \to N(0,1)$.
Two-Sample t-Test
Let $X_1, \ldots, X_m \sim N(\mu_X, \sigma^2)$ and $Y_1, \ldots, Y_n \sim N(\mu_Y, \sigma^2)$ independently (equal-variance assumption). Test $H_0: \mu_X = \mu_Y$.
The pooled variance estimator is
$$S_p^2 = \frac{(m-1)S_X^2 + (n-1)S_Y^2}{m+n-2}$$
and the test statistic is
$$T = \frac{\bar{X} - \bar{Y}}{S_p\sqrt{1/m + 1/n}} \overset{H_0}{\sim} t_{m+n-2}$$
When equal variance cannot be assumed, Welch’s t-test uses an approximate degrees of freedom given by the Satterthwaite formula.
The Neyman-Pearson Lemma
Lemma (Neyman-Pearson, 1933). For testing a simple null $H_0: \theta = \theta_0$ against a simple alternative $H_1: \theta = \theta_1$, the most powerful test of size $\alpha$ rejects $H_0$ when
$$\Lambda = \frac{L(\theta_1 \mid \mathbf{X})}{L(\theta_0 \mid \mathbf{X})} > k$$
where $k$ is chosen so that $P(\Lambda > k \mid H_0) = \alpha$.
This likelihood ratio test (LRT) achieves the maximum power among all tests of size $\alpha$. For composite hypotheses, the generalized LRT statistic $-2 \log \Lambda \overset{d}{\to} \chi^2_r$ under $H_0$, where $r$ is the difference in the number of free parameters - this is Wilks' theorem.
Multiple Testing
When performing $m$ hypothesis tests simultaneously, the probability of at least one spurious rejection under all nulls is
$$P(\text{at least one Type I error}) = 1 - (1-\alpha)^m \to 1 \text{ as } m \to \infty$$
This is the family-wise error rate (FWER) problem.
Bonferroni correction controls FWER at level $\alpha$ by testing each hypothesis at level $\alpha/m$. It is conservative when tests are positively correlated.
False discovery rate (FDR), introduced by Benjamini and Hochberg (1995), controls the expected proportion of false rejections among all rejections. The Benjamini-Hochberg procedure: sort p-values $p_{(1)} \leq \cdots \leq p_{(m)}$; reject all $H_{(i)}$ with $i \leq \hat{k}$ where
$$\hat{k} = \max\left{i : p_{(i)} \leq \frac{i}{m} \cdot \alpha\right}$$
Under independence, this controls $\text{FDR} \leq \alpha$.
Chi-Squared Test for Independence
Given a contingency table of counts ${O_{ij}}$ for two categorical variables, the test of independence uses
$$\chi^2 = \sum_{i,j} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$
where $E_{ij} = (\text{row } i \text{ total}) \times (\text{col } j \text{ total}) / n$ are expected counts under independence. Under $H_0$, $\chi^2 \overset{d}{\to} \chi^2_{(r-1)(c-1)}$ where $r, c$ are the number of rows and columns.
Examples
A/B Testing. A product team shows variant A to $n_A = 5000$ users and variant B to $n_B = 5000$ users. Conversion rates are $\hat{p}_A = 0.042$, $\hat{p}_B = 0.051$. A two-proportion z-test gives $p = 0.008 < 0.05$, so the team rejects the null of equal conversion and ships variant B.
Clinical Trials. A drug trial with $n = 200$ patients per arm uses a two-sample t-test on a continuous endpoint. The pre-specified significance level is $\alpha = 0.05$ with power $0.80$ at a clinically meaningful effect size, which determined $n$ during the design phase.
Multiple Comparisons in ML Evaluation. When comparing 20 models on a benchmark, Bonferroni correction requires $p < 0.05/20 = 0.0025$ for any individual comparison to be declared significant. Alternatively, BH-FDR at 10% allows more discoveries while bounding the proportion of false positives.
Read Next: