Causal Inference - Correlation Isn't Wrong, You Just Asked the Wrong Question
Helpful context:
- Statistics - Turning Data Into Defensible Claims
- Conditional Probability - What You Already Know Changes Everything
Ice cream sales and drowning deaths follow the same seasonal curve. They peak together in July; they trough together in January. The correlation is real, measurable, and statistically significant.
Should cities ban ice cream to reduce drowning?
Obviously not. Both rise in summer because hot weather drives people to both buy ice cream and swim. Hot weather is the common cause. Ice cream has nothing to do with drowning.
This kind of mistake sounds easy to avoid - but variants of it are everywhere in real data analysis. Smoking and lung cancer: maybe sick people smoke to feel better (reverse causation). Education and wages: maybe smarter people both get more education and earn more (confounding). Hospitals and death: hospitals have higher death rates than staying home - does that mean hospitals are dangerous? No, obviously sick people go to hospitals.
Correlation is not causation. You have heard this a thousand times. But “not causation” leaves unanswered the harder question: then how do you establish causation? Causal inference is the science of answering that question rigorously - of going beyond patterns in data to understand what actually happens when you intervene.
The Fundamental Problem
Here is why causation is hard: to know whether a drug works for a patient, you would need to observe what happens when that patient takes the drug and what happens when the same patient, in the same state, at the same time, does not take the drug. You need both outcomes simultaneously for the same individual.
You can only observe one.
The patient either takes the drug or does not. The outcome you did not observe is called a counterfactual: what would have happened under the alternative. This is the fundamental problem of causal inference, formalized by Holland (1986): individual causal effects are inherently unobservable.
This is not a problem of data collection or technology. It is a logical impossibility. No amount of additional data will let you observe a person in two states at once. Causal inference is the discipline of making rigorous inferences despite this impossibility.
The Potential Outcomes Framework
The modern probabilistic treatment of causation, developed by Donald Rubin and others, begins with potential outcomes (also called the Rubin Causal Model).
For each individual $i$ and binary treatment $T \in \{0, 1\}$, define:
- $Y_i(1)$: the outcome for individual $i$ if they receive treatment.
- $Y_i(0)$: the outcome for individual $i$ if they do not receive treatment.
Both are defined for every individual. But you observe at most one:
$$Y_i^{\text{obs}} = T_i \cdot Y_i(1) + (1 - T_i) \cdot Y_i(0).$$
The individual treatment effect for person $i$ is:
$$\tau_i = Y_i(1) - Y_i(0).$$
This quantity is never observed for any individual. The problem of causal inference is the problem of making statements about $\tau_i$ (or summaries of it) without ever being able to observe it directly.
The quantity we can hope to estimate at the population level is the Average Treatment Effect (ATE):
$$\tau = \mathbb{E}[Y(1) - Y(0)] = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)].$$
This is the average, across the population, of what the treatment does. Not what it does to any individual - that’s unknowable - but what it does on average.
Randomization: The Gold Standard
In a Randomized Controlled Trial (RCT), treatment assignment $T$ is determined by a coin flip, independent of everything else about the individual. This is the key: independence of treatment from potential outcomes.
Formally: $T \perp\perp (Y(0), Y(1))$.
When this holds, the observed difference in outcomes between treated and control groups equals the ATE:
$$\mathbb{E}[Y^{\text{obs}} \mid T=1] - \mathbb{E}[Y^{\text{obs}} \mid T=0] = \mathbb{E}[Y(1) \mid T=1] - \mathbb{E}[Y(0) \mid T=0] = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = \tau.$$
The second equality uses the independence $T \perp\perp Y(1)$ and $T \perp\perp Y(0)$: people assigned to the treated group have the same distribution of potential outcomes as the control group, and vice versa. The groups are comparable in expectation, on all observed and unobserved characteristics, because treatment was assigned randomly.
This is why RCTs are the gold standard. Randomization “solves” the fundamental problem of causal inference - not by making counterfactuals observable, but by making the two groups statistically equivalent so that the observed difference estimates the causal effect.
Discomfort check. Even a perfect RCT only tells you about the average treatment effect. Individual effects $\tau_i$ remain unobservable. The drug might help 80% of people and harm 20%, with the ATE still being positive. You might approve a drug based on a positive ATE and harm a fifth of the people who take it. Heterogeneous treatment effects - how effects vary across subgroups - are a separate, harder estimation problem. RCTs give population-level averages, not personalized medicine.
Confounding: The Enemy of Observational Studies
Most studies are not RCTs. Ethical constraints (you cannot randomize people to smoke), practical constraints (randomizing countries to different economic policies), cost, and time all push us toward observational data: we watch what people naturally choose to do, and try to learn from it.
The problem: in observational data, treatment assignment is not random. People who receive a treatment may be systematically different from those who do not. These systematic differences - confounders - corrupt the simple comparison.
Example: suppose you want to know whether a job training program increases earnings. You compare the earnings of people who enrolled in the program versus those who did not. But people who enroll may be more motivated, or have more social support, or be less severely unemployed. These factors affect earnings regardless of the program. The observed difference in earnings reflects both the program effect and these pre-existing differences.
Formally, in observational data, $T \not\perp\perp (Y(0), Y(1))$. The groups are not comparable. The quantity $\mathbb{E}[Y^{\text{obs}} \mid T=1] - \mathbb{E}[Y^{\text{obs}} \mid T=0]$ is not the ATE; it is contaminated by confounding.
Causal Graphs: Pearl’s Framework
Judea Pearl developed a complementary framework to potential outcomes using Directed Acyclic Graphs (DAGs). A DAG has:
- Nodes: variables (treatment $T$, outcome $Y$, covariates $X$, confounders $Z$, …).
- Directed edges: $A \to B$ means $A$ directly causes $B$.
- No cycles: you cannot follow arrows and return to where you started.
The DAG encodes the causal structure of the data-generating process. In the ice cream/drowning example: Hot Weather $\to$ Ice Cream Sales, Hot Weather $\to$ Drownings. There is no arrow from Ice Cream to Drownings.
A confounder $Z$ is a common cause of both $T$ and $Y$: $Z \to T$ and $Z \to Y$. Confounders create spurious correlations - correlation between $T$ and $Y$ that does not reflect a causal relationship. In the DAG, confounders create backdoor paths: non-causal paths from $T$ to $Y$ that run through common causes.
The backdoor criterion: if you can identify and condition on a set of variables $Z$ that blocks all backdoor paths from $T$ to $Y$ without blocking any causal paths, then you can identify the causal effect from observational data:
$$\mathbb{E}[Y(1) - Y(0)] = \sum_z \big[\mathbb{E}[Y \mid T=1, Z=z] - \mathbb{E}[Y \mid T=0, Z=z]\big] \cdot P(Z=z).$$
This is the adjustment formula. Stratify by $Z$, compare treated and control within each stratum, then average across strata. Within each stratum, conditional on $Z$, the treatment assignment is as good as random (you have controlled for the confounder).
When Controlling Doesn’t Work
The adjustment formula only works if:
- You can identify which variables to control for (the correct adjustment set).
- You can measure them.
- You have data on all the necessary confounders.
If there are unobserved confounders - common causes of $T$ and $Y$ that you cannot measure - controlling for observed variables is not enough. The backdoor path through the unobserved confounder remains open. No amount of statistical adjustment on observed variables can close it.
This is the central challenge of observational causal inference. You can always worry that there is some unmeasured variable driving both who gets treated and what outcome they experience. Sensitivity analysis quantifies how strong an unobserved confounder would need to be to overturn your conclusions. But it cannot rule out confounding entirely.
Instrumental Variables
One approach to unobserved confounding is the instrumental variable (IV) design. An instrument $Z$ satisfies:
- Relevance: $Z$ affects treatment $T$. ($Z$ is correlated with $T$.)
- Exclusion restriction: $Z$ affects outcome $Y$ only through $T$. (No direct path $Z \to Y$, no path $Z \to \text{confounder} \to Y$.)
- Independence: $Z$ is independent of the unobserved confounder.
If such a variable exists, you can estimate the causal effect even in the presence of unobserved confounding. The IV estimator is:
$$\hat{\tau}_{IV} = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, T)}.$$
This is the reduced-form effect of $Z$ on $Y$ divided by the first-stage effect of $Z$ on $T$. Intuitively: $Z$ creates variation in $T$ that is unrelated to the confounders; you exploit only this clean variation to identify the causal effect.
Classic instruments:
- Distance to college: Affects whether someone attends college (relevance); affects earnings mainly through education (exclusion).
- Vietnam draft lottery number: Randomized military service (relevance, independence); affects wages through military service (exclusion).
- Quarter of birth: Affects educational attainment through compulsory schooling laws (relevance); affects wages through education.
IV estimates a specific quantity: the Local Average Treatment Effect (LATE) - the causal effect for compliers, people whose treatment status is changed by the instrument. This is not the same as the ATE. If your instrument only moves a small subset of the population, the LATE may not generalize.
Natural Experiments
When a true randomized experiment is impossible, researchers look for natural experiments: situations where some external event or rule creates variation in treatment that is, for practical purposes, random.
Examples:
- Minimum wage: Card and Krueger (1994) compared fast food employment in New Jersey (which raised its minimum wage) to neighboring Pennsylvania (which did not), exploiting the state border as a natural experiment.
- Regression discontinuity: People just above and just below a test score cutoff for a scholarship are similar in all ways except whether they received the scholarship. Comparing these near-marginal groups identifies the effect of the scholarship.
- Twins: Comparing outcomes between identical twins controls for genetic confounders; differences must be environmental.
- Lotteries: School choice lotteries randomly assign access to charter schools; winners and losers are comparable before the lottery.
Natural experiments are valuable because they often provide cleaner identification than IV methods, with more plausible independence assumptions. But they are opportunistic - you can only study questions for which a natural experiment exists.
Causal Inference and Machine Learning
Standard machine learning is built for prediction: given features $X$, predict outcome $Y$ as accurately as possible. It optimizes correlation - whichever features predict $Y$ well, regardless of whether they cause $Y$.
This creates serious problems when predictions are used to make decisions.
A recommendation system trained to maximize clicks may discover that showing emotionally provocative content increases engagement. It learns a correlation. When it acts on this by surfacing more provocative content, it may cause a change in user behavior and polarization - an effect that was not present in the training data. The system was not modeling causation; it was modeling a snapshot of a world it would then change.
More generally: if you use a predictive model to choose an intervention (which drug to give, which ad to show, which policy to enact), you are using observational correlation to make a causal decision. This conflation is one of the main failure modes of deployed ML systems.
Causal ML asks not just “what predicts $Y$?” but “if I change $X$, what happens to $Y$?” Methods like Double Machine Learning (Chernozhukov et al.) and Causal Forests (Wager & Athey) combine the flexibility of machine learning with the rigor of causal identification, estimating heterogeneous treatment effects while controlling for high-dimensional confounders. These methods are increasingly central to applied economics, medicine, and tech industry experimentation.
Summary
| Concept | Definition |
|---|---|
| Potential outcomes $Y_i(0), Y_i(1)$ | What would happen to person $i$ with and without treatment |
| Fundamental problem | Individual treatment effects are unobservable; only one outcome per person |
| ATE | $\mathbb{E}[Y(1) - Y(0)]$: population-average causal effect |
| RCT | Randomization makes $T \perp\perp (Y(0), Y(1))$; observed difference = ATE |
| Confounder | Common cause of $T$ and $Y$; creates spurious correlation |
| DAG | Directed Acyclic Graph encoding causal structure |
| Backdoor path | Non-causal path creating spurious correlation; blocked by conditioning on confounders |
| Adjustment formula | Stratify by confounders, compare within strata, average across strata |
| Instrumental variable | Variable affecting $T$ but affecting $Y$ only through $T$; handles unobserved confounding |
| LATE | Effect of treatment for compliers - those whose treatment changes with the instrument |
| Natural experiment | External variation that mimics randomization |
The fundamental insight of causal inference is that causal questions cannot be answered by statistical association alone. No matter how large the dataset, correlation does not become causation with more data. You need additional assumptions - randomization, conditional ignorability, exclusion restrictions - to identify causal effects. These assumptions are not statistical; they are claims about the world. The discipline of causal inference is the discipline of making those assumptions explicit, checking them where possible, and drawing conclusions that are honest about what the data can and cannot tell you.
Read next: