Causal Inference
Prerequisite:
Statistical association is not causation. This post develops the mathematical frameworks that let us move beyond correlation and reason about what would happen if we intervened in a system.
Correlation vs Causation
Two variables $X$ and $Y$ are correlated if $\text{Cov}(X, Y) \neq 0$. Correlation can arise from three sources:
- $X$ causes $Y$
- $Y$ causes $X$
- A common cause $Z$ (confounder) drives both
Standard regression of $Y$ on $X$ estimates the association $E[Y \mid X = x]$, which conflates all three. Causal inference asks: what is $E[Y \mid \text{do}(X = x)]$ - the expectation under an intervention that sets $X$ to $x$, rather than a passive observation?
The Potential Outcomes Framework (Rubin)
Each unit $i$ has two potential outcomes:
- $Y_i(0)$: the outcome unit $i$ would experience under control ($T_i = 0$)
- $Y_i(1)$: the outcome unit $i$ would experience under treatment ($T_i = 1$)
The individual treatment effect is $\tau_i = Y_i(1) - Y_i(0)$.
The average treatment effect (ATE) is
$$\tau = E[Y_i(1) - Y_i(0)] = E[Y_i(1)] - E[Y_i(0)]$$
The fundamental problem of causal inference: we observe only $Y_i^{\text{obs}} = T_i Y_i(1) + (1-T_i)Y_i(0)$ - one potential outcome per unit, never both simultaneously. $\tau_i$ is therefore never directly observed; only population-level quantities like ATE can be estimated under assumptions.
Randomized Controlled Trials
In a randomized controlled trial (RCT), treatment assignment $T_i$ is independent of potential outcomes: $T_i \perp (Y_i(0), Y_i(1))$.
Under randomization, the naive difference-in-means estimator is unbiased for ATE:
$$E[\bar{Y}_1 - \bar{Y}_0] = E[Y_i(1) \mid T_i = 1] - E[Y_i(0) \mid T_i = 0]$$ $$= E[Y_i(1)] - E[Y_i(0)] = \tau$$
The second equality follows directly from independence: $E[Y_i(1) \mid T_i = 1] = E[Y_i(1)]$ because randomization severs any systematic relationship between who gets treated and what their outcomes would have been. This is why randomization is the gold standard.
Observational Studies: Confounding and Selection Bias
In observational data, $T_i$ may be correlated with covariates $X_i$ that also affect $Y_i$. Such variables are confounders.
The naive estimator $E[Y \mid T=1] - E[Y \mid T=0]$ estimates:
$$E[Y_i(1) \mid T_i=1] - E[Y_i(0) \mid T_i=0]$$
which equals $\tau$ only if there is no confounding. The difference $E[Y_i(1) \mid T_i=1] - E[Y_i(1) \mid T_i=0]$ (selection bias) contaminates the estimate.
Unconfoundedness (ignorability) assumption: conditional on observed covariates $X_i$,
$$T_i \perp (Y_i(0), Y_i(1)) \mid X_i$$
Under unconfoundedness, we can identify ATE from observational data. This is a strong, untestable assumption - we must argue it from domain knowledge.
The Backdoor Criterion and do-Calculus (Pearl)
Pearl’s structural causal model (SCM) framework represents causal relationships as a directed acyclic graph (DAG) where nodes are variables and edges encode direct causation.
The do-operator $P(Y \mid \text{do}(X=x))$ represents the distribution of $Y$ after intervening to set $X = x$, which generally differs from $P(Y \mid X=x)$.
Backdoor criterion. A set of variables $Z$ satisfies the backdoor criterion relative to $(X, Y)$ in DAG $G$ if:
- No node in $Z$ is a descendant of $X$
- $Z$ blocks every path between $X$ and $Y$ that has an arrow into $X$ (a “backdoor path”)
Backdoor adjustment theorem. If $Z$ satisfies the backdoor criterion:
$$P(Y \mid \text{do}(X=x)) = \sum_z P(Y \mid X=x, Z=z),P(Z=z)$$
This is the causal identification formula: intervening on $X$ is equivalent to conditioning on $X$ and marginalizing over $Z$ according to its observational distribution. do-calculus provides three rules that, together, can identify any identifiable causal query from a DAG and observational data.
Propensity Scores
The propensity score is $e(x) = P(T=1 \mid X=x)$ - the probability of treatment given covariates.
Theorem (Rosenbaum and Rubin, 1983). If unconfoundedness holds given $X$, it also holds given $e(X)$:
$$T \perp (Y(0), Y(1)) \mid e(X)$$
This reduces a high-dimensional balancing problem to one dimension. The propensity score is typically estimated via logistic regression.
Inverse probability weighting (IPW) uses the propensity score to construct an unbiased ATE estimator:
$$\hat{\tau}{\text{IPW}} = \frac{1}{n}\sum{i=1}^n \left(\frac{T_i Y_i}{e(X_i)} - \frac{(1-T_i)Y_i}{1 - e(X_i)}\right)$$
The weights $1/e(X_i)$ upweight treated units with low treatment probability (who resemble controls) and vice versa, effectively creating a pseudo-population in which treatment is independent of covariates. The doubly robust estimator combines IPW with outcome regression and is consistent if either model is correctly specified.
Difference-in-Differences
Difference-in-differences (DiD) uses panel data with pre- and post-treatment periods. Define
$$\hat{\tau}{\text{DiD}} = (\bar{Y}^{\text{post}}{\text{treat}} - \bar{Y}^{\text{pre}}{\text{treat}}) - (\bar{Y}^{\text{post}}{\text{control}} - \bar{Y}^{\text{pre}}_{\text{control}})$$
Under the parallel trends assumption - that treated and control groups would have followed the same trend in the absence of treatment - DiD identifies the average treatment effect on the treated (ATT). DiD controls for time-invariant confounders and common time trends without requiring unconfoundedness.
Regression Discontinuity
In a regression discontinuity (RD) design, treatment is assigned based on whether a running variable $V_i$ exceeds a cutoff $c$. The local average treatment effect at the cutoff is identified by comparing outcomes just above and just below $c$:
$$\tau_{\text{RD}} = \lim_{v \downarrow c} E[Y \mid V=v] - \lim_{v \uparrow c} E[Y \mid V=v]$$
RD is credible when units cannot precisely manipulate the running variable. The main threats are discontinuities in the density of $V$ at $c$ (McCrary test) and discontinuities in other covariates at $c$.
Instrumental Variables
An instrument $Z$ is a variable that affects $Y$ only through $T$ (exclusion restriction) and is independent of unmeasured confounders (exogeneity). The instrumental variables (IV) estimator is
$$\hat{\tau}_{\text{IV}} = \frac{\text{Cov}(Y, Z)}{\text{Cov}(T, Z)}$$
Under monotonicity, IV identifies the local average treatment effect (LATE): the ATE among “compliers” - units whose treatment status is changed by the instrument.
Examples
Uplift Modeling in ML. Recommender systems want to target users who would convert because of the intervention, not those who would convert anyway (always-takers) or never convert (never-takers). Uplift models estimate $\tau_i = Y_i(1) - Y_i(0)$ at the individual level using causal forests or meta-learners (S-learner, T-learner, X-learner), trained on data from a past RCT.
Causal Graphs for Feature Selection. In a prediction pipeline, including a descendant of the target $Y$ as a feature induces collider bias and can reduce out-of-distribution generalization. Drawing the causal DAG of the data-generating process reveals which features are confounders (include), mediators (may exclude), and colliders (exclude) for the estimand of interest.
Read Next: