Experimental Design & A/B Testing
Prerequisite:
Experimental design is the discipline of planning data collection so that causal questions can be answered efficiently and reliably. A/B testing is its most common instantiation in technology, but the underlying principles - controlling error rates, maximizing power, avoiding bias - apply universally.
Goals of Experimental Design
A well-designed experiment must simultaneously:
- Control the Type I error rate at the pre-specified $\alpha$
- Achieve target power $1 - \beta$ for a meaningful effect size $\delta$
- Eliminate systematic bias so that the estimated effect is attributable to the treatment alone
These goals are in tension. More stringent error control (lower $\alpha$) or higher power (lower $\beta$) both require larger samples, increasing cost.
Key Concepts: Factors, Blocks, Replication
- Factor: a controlled variable whose effect is under study (e.g., UI variant, drug dose)
- Level: a specific value of a factor (e.g., blue button vs red button)
- Block: a group of experimental units that are homogeneous with respect to a nuisance variable. Blocking on a covariate reduces within-group variance and increases power.
- Replication: assigning multiple units to each treatment combination, allowing estimation of within-treatment variability
The completely randomized design (CRD) assigns units to treatments purely at random with no blocking. The linear model is $Y_{ij} = \mu + \tau_j + \varepsilon_{ij}$ where $\tau_j$ is the effect of treatment $j$ and $\varepsilon_{ij} \overset{iid}{\sim} N(0, \sigma^2)$.
Factorial Designs and Interactions
A factorial design varies multiple factors simultaneously. With factors $A$ (at $a$ levels) and $B$ (at $b$ levels), a full factorial requires $a \times b$ cells. The model is
$$Y_{ijk} = \mu + \alpha_i + \beta_j + (\alpha\beta){ij} + \varepsilon{ijk}$$
The interaction term $(\alpha\beta)_{ij}$ captures whether the effect of $A$ differs across levels of $B$. Testing for interactions is essential: if a strong interaction is present, a main-effects-only interpretation is misleading. A $2^k$ factorial design with $k$ binary factors is especially common in tech experiments, where multiple UI changes are tested jointly.
Fractional factorial designs run only a fraction $2^{k-p}$ of cells, deliberately confounding high-order interactions to reduce cost while preserving estimation of main effects and two-factor interactions.
Power Analysis and Sample Size
Before running an experiment, determine the required sample size $n$ per arm. For a two-sample z-test comparing means with common variance $\sigma^2$, minimum detectable effect $\delta = \mu_1 - \mu_0$, Type I rate $\alpha$, and power $1-\beta$:
$$n = \frac{(z_{\alpha/2} + z_\beta)^2 \cdot 2\sigma^2}{\delta^2}$$
Here $z_{\alpha/2}$ and $z_\beta$ are the standard normal quantiles corresponding to the two-tailed significance level and the desired power. For $\alpha = 0.05$ and $1-\beta = 0.80$, $(z_{0.025} + z_{0.20})^2 \approx (1.96 + 0.84)^2 \approx 7.85$.
Key observations:
- $n \propto 1/\delta^2$: halving the minimum detectable effect quadruples the required $n$
- $n \propto \sigma^2$: reducing metric variance (e.g., via CUPED variance reduction) cuts the required sample size
- $n \propto (z_{\alpha/2} + z_\beta)^2$: moving from 80% to 90% power increases $n$ by about 27%
For binary outcomes (conversion rates), replace $\sigma^2$ with $p(1-p)$ pooled under the null and re-solve. Power calculators automate this, but understanding the formula is essential for principled metric choice.
A/B Testing Pipeline
Randomization Unit
The randomization unit should be the unit of analysis. Common choices are user, session, or device. Mismatching the randomization unit and analysis unit (e.g., randomizing by user but analyzing by page view) inflates false positive rates due to within-unit correlation - a form of pseudoreplication.
Metric Choice
Every experiment should have:
- Primary metric: the one metric the experiment is powered to detect an effect on (e.g., 7-day retention)
- Guardrail metrics: metrics that must not degrade (e.g., crash rate, latency)
Choosing a metric that is too noisy requires a large sample; too sensitive a metric risks shipping changes that improve a proxy but not the true objective.
The Peeking Problem
Repeatedly checking a p-value as data accumulate - peeking - inflates the Type I error rate far above $\alpha$. If you check at $k$ equally-spaced interim analyses using a fixed threshold $\alpha = 0.05$, the actual FWER can exceed 20% for even moderate $k$.
Formally, the issue is that ${p_t}$ is a function of a random walk, and the boundary $p_t \leq 0.05$ is crossed with higher probability when checked multiple times.
Sequential Testing and Alpha Spending
Sequential testing fixes the FWER by using a pre-specified alpha spending function $\alpha(t)$ that distributes the error budget over information fraction $t \in [0,1]$. Two standard boundaries are:
- O’Brien-Fleming: very conservative early (rarely stops early on noise), spends most of $\alpha$ near the end
- Pocock: spends $\alpha$ uniformly, easier to stop early but uses a stricter final threshold
The always-valid p-value (or e-value) framework provides an alternative: construct a test martingale $M_t$ such that $E[M_t \mid H_0] \leq 1$; reject when $M_t \geq 1/\alpha$. This allows continuous monitoring with exact error control.
Multi-Armed Bandits vs A/B Tests
A/B testing is pure exploration: it collects data uniformly across arms without adapting, then makes a decision at the end. This ignores the opportunity to reduce assignment to inferior arms during the experiment.
Multi-armed bandit (MAB) algorithms trade off exploration (learning about arms) and exploitation (assigning to the best known arm). Common strategies:
- $\varepsilon$-greedy: with probability $1-\varepsilon$ exploit the current best arm; with probability $\varepsilon$ explore uniformly
- Thompson sampling: maintain a posterior over arm parameters; sample from the posterior and pull the arm with the highest sample
- UCB (Upper Confidence Bound): pull the arm maximizing $\hat{\mu}_a + c\sqrt{\log t / n_a}$, which adds an optimism bonus for underexplored arms
MABs reduce regret (cumulative opportunity cost) relative to fixed A/B tests but complicate post-experiment inference: the adaptive sampling induces correlation between arms and time, violating the i.i.d. assumption of standard tests.
Common Pitfalls
Novelty effect. Users may respond to any change positively simply because it is new. Metrics inflated by novelty will decay over time. Mitigate by looking at behavior of long-tenured users, or by running the experiment long enough to observe stabilization.
Network effects and SUTVA violation. The Stable Unit Treatment Value Assumption (SUTVA) requires that one unit’s outcome depends only on that unit’s treatment, not on others'. Social platforms violate SUTVA when treated users interact with control users (interference). Solutions include cluster randomization (randomize by community or geographic region) or graph-cluster randomization.
Simpson’s paradox. A treatment may appear beneficial in aggregate but harmful in every subgroup, or vice versa, due to confounding by a variable correlated with group size and outcome. Always stratify by key covariates and examine heterogeneous treatment effects.
Examples
Tech Company Experimentation Platforms. Companies like Netflix, Airbnb, and LinkedIn run thousands of A/B tests simultaneously. They use shared infrastructure for randomization (hash-based assignment), metric computation (usually CUPED-adjusted to reduce variance), and sequential testing dashboards. Guardrail metrics are monitored continuously with conservative alpha-spending to avoid shipping regressions.
Online Learning with Bandit Algorithms. News feed ranking, ad selection, and recommendation systems use MAB algorithms where the “arms” are content items or ranking policies. Thompson sampling over a Bayesian model of click-through rates allows the system to adapt in real time, concentrating traffic on high-performing items while still exploring new ones - achieving lower cumulative regret than a fixed holdout A/B test.
Read Next: