Hypothesis Testing & p-values - The Formal Language of Probably Not a Coincidence // Megha Bose

Helpful context:

A pharmaceutical company tests a new drug. They enroll 200 patients, give half the drug and half a placebo. After eight weeks, the treated group improves by an average of 2.3 points on a symptom scale; the control group improves by 0.8 points. The drug looks better - but is it actually better, or could this difference have arisen by random chance even if the drug did nothing?

This is the central question of hypothesis testing. Answering it incorrectly - declaring a useless drug effective, or dismissing an effective drug as noise - can mean millions of people are harmed or denied help. The framework for answering it is subtle, widely misunderstood, and deeply consequential.

The Setup

Hypothesis testing frames the problem as a choice between two explanations.

The null hypothesis $H_0$: the boring explanation. The default, the status quo, the claim that nothing interesting is happening. In the drug trial: $H_0: \mu_\text{treatment} = \mu_\text{control}$ - the drug has no effect; both groups improve at the same rate.

The alternative hypothesis $H_1$: what you would conclude if $H_0$ were false. Usually $H_1: \mu_\text{treatment} \neq \mu_\text{control}$ (two-sided), or $H_1: \mu_\text{treatment} > \mu_\text{control}$ (one-sided, if you have prior reason to believe the drug only helps).

A critical point: you never “prove” $H_1$. You ask whether the data are improbable under $H_0$. If yes - if the data would be very surprising in a world where $H_0$ is true - you reject $H_0$ in favor of $H_1$. If the data are consistent with $H_0$, you fail to reject $H_0$. You do not accept $H_0$; absence of evidence is not evidence of absence.

The asymmetry is intentional. It mirrors the burden of proof: we require evidence to overturn the null, not evidence to support it.

The p-value

The p-value is the probability of observing data at least as extreme as what you actually saw, assuming $H_0$ is true:

$$p = P(\text{test statistic} \geq t_\text{observed} \mid H_0).$$

“At least as extreme” means: in the direction that would count as evidence against $H_0$. For a two-sided test, it means the absolute value of the statistic is at least as large as what you observed.

A small p-value means: the data would be very unlikely if $H_0$ were true. This is evidence against $H_0$ - not proof that $H_0$ is false, but surprise at what you saw under the assumption it is true.

What a p-value is NOT

This is where most confusion lives.

The p-value is not $P(H_0 \text{ is true} \mid \text{data})$. That quantity - the probability of the null hypothesis given the data - is a posterior probability. Computing it requires a prior over hypotheses, which is a Bayesian operation (see Bayesian Inference - Updating Your Beliefs the Mathematically Correct Way ). The p-value is frequentist and makes no statement about the probability that $H_0$ is true.

The p-value is not the probability that your result was due to chance. This is the same misinterpretation in different words.

The p-value is not a measure of effect size. A tiny, clinically meaningless effect can produce $p < 0.001$ with a large enough sample. Statistical significance and practical significance are different things.

The p-value is a measure of how surprised you would be, given the null hypothesis. Nothing more.

The Test Statistic and Its Null Distribution

To compute a p-value, you need two things: a test statistic and its distribution under $H_0$.

One-sample z-test

You observe a sample $x_1, \ldots, x_n$ and want to test whether the population mean $\mu$ equals some specific value $\mu_0$. If $\sigma$ (the population standard deviation) is known, the test statistic is:

$$Z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}.$$

Under $H_0$, $Z \sim \mathcal{N}(0,1)$ exactly (by the Central Limit Theorem, for large $n$, or exactly if the data are Normal). A value of $Z = 2$ means the sample mean is 2 standard errors above $\mu_0$ - that happens with probability about 2.3% under $H_0$ (one-sided), or 4.6% (two-sided).

One-sample t-test

In practice, $\sigma$ is usually unknown. You replace it with the sample standard deviation $s = \sqrt{\frac{1}{n-1}\sum(x_i - \bar{x})^2}$:

$$T = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}.$$

Under $H_0$, $T$ follows a t-distribution with $n-1$ degrees of freedom, written $t_{n-1}$. The t-distribution has heavier tails than the normal - it accounts for the additional uncertainty in estimating $\sigma$. As $n \to \infty$, $t_{n-1} \to \mathcal{N}(0,1)$.

Worked example: drug trial

Suppose in our drug trial we observe $n = 100$ treated patients with sample mean improvement $\bar{x} = 2.3$, sample standard deviation $s = 4.0$, and the null hypothesis says the population mean is $\mu_0 = 0$ (no effect). The t-statistic is:

$$T = \frac{2.3 - 0}{4.0 / \sqrt{100}} = \frac{2.3}{0.4} = 5.75.$$

Under $H_0$, this follows $t_{99} \approx \mathcal{N}(0,1)$. A value of 5.75 is far out in the tail: the two-sided p-value is $P(|T| \geq 5.75) \approx 2 \times 10^{-8}$. That is extremely unlikely under the null. We reject $H_0$.

Significance Level and the Decision Rule

Before seeing any data, you commit to a significance level $\alpha$: the threshold below which you will reject $H_0$. If $p < \alpha$, reject; if $p \geq \alpha$, fail to reject.

The conventional choice is $\alpha = 0.05$. This is an arbitrary tradition, not a mathematical law. Ronald Fisher chose it in the 1920s as a round number that seemed reasonable; it has no deep justification. Different fields use different values: particle physics uses $\alpha \approx 3 \times 10^{-7}$ (five sigma), genetics often uses $\alpha = 5 \times 10^{-8}$ for genome-wide studies.

The significance level is chosen to control a specific error.

Two Kinds of Error

There are two ways to be wrong in hypothesis testing.

Type I error (false positive): $H_0$ is true, but you reject it. You conclude there is an effect when there is none. By construction, $P(\text{Type I error}) = \alpha$. When you set $\alpha = 0.05$, you accept a 5% chance of falsely rejecting a true null.

Type II error (false negative): $H_0$ is false, but you fail to reject it. The drug works, but your study failed to detect it. The probability of a Type II error is denoted $\beta$ - not to be confused with the parameter $\beta$ in Bayesian analysis.

Power $= 1 - \beta$: the probability of correctly rejecting a false null. Power depends on three things:

Effect size: a larger true effect is easier to detect.
Sample size $n$: more data reduces noise and increases power.
Significance level $\alpha$: a looser threshold (larger $\alpha$) makes rejection easier, increasing power but also increasing Type I error.

In the drug trial: if the true effect is only 0.5 points (small) and $s = 4$, you would need a very large sample to detect it reliably. Power analysis - computing the sample size needed to achieve 80% or 90% power for a given effect size - is done before running the experiment.

Discomfort check. These two error types trade off: decreasing $\alpha$ to reduce false positives also increases the probability of false negatives. There is no setting that simultaneously minimizes both. The conventional $\alpha = 0.05$ and 80% power (i.e., $\beta = 0.20$) is a historical compromise, not an optimal solution. Many researchers argue for lower $\alpha$ in fields where false positives are costly, or for rethinking the framework entirely.

The Problem with p < 0.05 as a Threshold

The significance threshold has become a cargo cult. Journals publish studies with $p < 0.05$ and reject those with $p > 0.05$, regardless of effect size, sample size, or scientific merit. This creates several serious problems.

Effect size is invisible. A study with $n = 1{,}000{,}000$ and a true effect of 0.001 units will have $p < 0.05$. That effect may be too small to matter clinically or practically. Reporting $p < 0.05$ without reporting the effect size and confidence interval is uninformative.

Publication bias. Studies with $p < 0.05$ get published; studies with $p > 0.05$ often do not. This means the published literature is a biased sample of all experiments run. A body of published studies with $p \approx 0.04$ provides weaker evidence than it appears, because the corresponding null results are invisible.

Optional stopping. If you keep collecting data until $p < 0.05$, you are guaranteed to eventually cross the threshold even if $H_0$ is true - because any random walk eventually hits any threshold. Running a test until you get significance is called p-hacking and dramatically inflates the true Type I error rate.

The Multiple Comparisons Problem

Suppose you run 20 independent hypothesis tests, each at $\alpha = 0.05$, and all 20 null hypotheses are actually true. The probability of at least one false positive is:

$$1 - (1 - 0.05)^{20} = 1 - 0.95^{20} \approx 0.64.$$

You have a 64% chance of incorrectly rejecting at least one true null just by running 20 tests. This is the multiple comparisons problem. It is endemic in exploratory data analysis, neuroimaging studies (where you test thousands of brain voxels), genetic association studies (millions of SNPs), and any setting where many hypotheses are tested simultaneously.

Bonferroni correction

The simplest fix: if you run $m$ tests and want the family-wise error rate (probability of any false positive) to be at most $\alpha$, use significance level $\alpha/m$ for each individual test.

$$\alpha_\text{per test} = \frac{\alpha}{m}.$$

For $m = 20$ and $\alpha = 0.05$: use $\alpha_\text{per test} = 0.0025$. This is conservative - it controls the worst-case probability - but it is a defensible correction when the tests are independent.

Other corrections (Benjamini-Hochberg, which controls the false discovery rate rather than the family-wise error rate) are less conservative and more appropriate when you expect some true effects among the many tests.

Two-sided vs. One-sided Tests

A two-sided test has $H_1: \mu \neq \mu_0$. The p-value counts probability in both tails: $P(|T| \geq |t_\text{obs}|)$. Use this when you have no prior reason to specify the direction of the effect.

A one-sided test has $H_1: \mu > \mu_0$ (or $< \mu_0$). The p-value counts probability in only one tail. Use this only when you would not care about an effect in the other direction - for example, testing whether a new drug is better (not merely different) than placebo.

One-sided tests are more powerful (lower p-value for the same effect in the predicted direction) but dangerous if misapplied: a drug that is strongly worse than placebo would not be flagged by a one-sided test in the “better” direction. In practice, two-sided tests are standard unless there is a compelling pre-specified reason for one-sided.

Non-parametric Tests

The z-test and t-test assume the data (or, by the CLT, the sample means) are approximately Normally distributed. When this fails - heavily skewed data, outliers, small samples from non-Normal populations - non-parametric tests are available.

The Mann-Whitney U test (also called the Wilcoxon rank-sum test) tests whether one group tends to have larger values than another, without assuming Normality. It works on ranks of the data rather than raw values, making it robust to outliers and skew.

The Wilcoxon signed-rank test is the non-parametric analogue of the one-sample t-test.

Non-parametric tests are less powerful than their parametric counterparts when the parametric assumptions hold, but more reliable when they do not.

Connection to Machine Learning

Hypothesis testing permeates machine learning practice, often without being named as such.

A/B testing is hypothesis testing. You want to know whether version B of a feature produces higher click-through rates than version A. The null hypothesis is that both versions have the same rate. The pitfalls - optional stopping, multiple comparisons from testing many metrics simultaneously, ignoring effect size - are the same.

Model comparison is hypothesis testing. “Is model A significantly better than model B?” can be framed as a test on evaluation metrics. The paired t-test (or Wilcoxon signed-rank test) on per-instance performance differences is a common approach.

p-hacking in ML. Running many experiments and reporting the best result without correction is the ML equivalent of p-hacking. Hyperparameter search over a large space is a form of multiple comparisons. Random seeds that happen to give good results inflate reported performance.

The core tension in ML evaluation is the same as in hypothesis testing: you want to know whether an observed improvement is real or is noise. The same mathematical framework applies.

Summary

Concept	Definition
Null hypothesis $H_0$	Default claim; no effect
Alternative hypothesis $H_1$	What you conclude if $H_0$ is rejected
p-value	$P(\text{data this extreme} \mid H_0)$
Significance level $\alpha$	Threshold for rejection; conventionally 0.05
Type I error	Reject $H_0$ when it is true; rate $= \alpha$
Type II error	Fail to reject $H_0$ when false; rate $= \beta$
Power	$1 - \beta$; probability of detecting a real effect
t-statistic	$(\bar{x} - \mu_0) / (s/\sqrt{n})$; follows $t_{n-1}$ under $H_0$
Multiple comparisons	Use Bonferroni: $\alpha_\text{per test} = \alpha / m$
Two-sided vs. one-sided	Two-sided is default; one-sided requires pre-specified direction

The p-value is a measure of surprise, not a measure of truth. A small p-value means the data would be unusual if the null were true - it is evidence against the null, but not proof of the alternative, and not a statement about effect size, importance, or replicability.

Use hypothesis testing for what it is: a disciplined framework for quantifying how much the data challenges a specific null claim. Be honest about multiple comparisons, preregister your hypotheses when possible, and always report effect sizes alongside p-values.

Read next: