Experimental Design & A/B Testing - Making Sure Your Data Answers Your Question // Megha Bose

Helpful context:

You work at a tech company. Your team changes the signup button from blue to green. You run an experiment for two weeks. Sign-ups increase by 2%. The p-value is 0.03. You ship the change. Three months later, sign-ups have returned to baseline - and you still have a green button.

What happened? Maybe the effect was never real: you were the unlucky 3% who get a false positive. Maybe it was real but temporary: users noticed something new and interacted with it, a phenomenon called the novelty effect. Maybe you ran the experiment during an unusual period - a marketing campaign was running, traffic mix shifted, a competitor went down. Maybe you checked the results daily and stopped when you crossed the threshold, inflating your false positive rate well above 5%.

Any of these. All of these. Naive A/B testing fails in practice more often than practitioners admit - not because the statistics are hard, but because the assumptions are routinely violated without anyone noticing. Experimental design is the discipline of running experiments whose results you can actually trust.

What a Controlled Experiment Actually Does

A randomized controlled experiment has one defining property: treatment assignment is independent of everything else about the unit being assigned.

When you randomly assign users to see the blue button or the green button, the two groups are - in expectation - identical on every dimension: age, location, browsing history, income, day of week they signed up, how tired they were, everything. You did not measure any of these things. You did not need to. Randomization handles them all at once.

This is the miracle of randomization, and it is why experiments are the foundation of causal inference. The observed difference in outcomes between the two groups can only be caused by the treatment, because the groups are otherwise identical in expectation. You do not need to control for confounders. You do not need to worry about which variables you forgot to measure. Randomization buys you all of this.

The experimental estimate of the treatment effect is therefore:

$$\hat{\tau} = \bar{Y}{\text{treatment}} - \bar{Y}{\text{control}},$$

and under the null hypothesis of no effect, this follows a known distribution - the basis for the p-value.

Setting Up the A/B Test

Before running a single experiment, you need to make several design decisions. Making them after seeing data is p-hacking.

Unit of Randomization

What entity do you randomize? Options: individual users, user sessions, page views, geographic regions, devices.

A critical principle: the unit of randomization should match the unit of analysis. If you randomize by user but analyze by session, sessions from the same user are correlated - your statistical test assumes independence, which is violated. You will underestimate variance and overstate significance.

If the treatment is something a user experiences repeatedly (like a UI change), randomize by user. If users only have one relevant interaction (like a one-time signup flow), session randomization may work.

Randomization Mechanism

The simplest mechanism: hash the user ID modulo 2. Users with even hashes go to control; odd hashes go to treatment. This is deterministic (the same user always gets the same assignment), reproducible, and avoids the problem of users switching groups between sessions.

More sophisticated designs use stratified randomization: ensure that the treatment and control groups are balanced on key covariates (age group, country, device type) by randomizing separately within each stratum. This reduces variance in the estimator and protects against unlucky randomizations where one group happens to be systematically different.

Metrics

Specify your primary metric before running the experiment. This is the one metric whose movement decides whether you ship. Common choices: sign-up rate, 7-day retention, revenue per user.

Also specify guardrail metrics: things that must not get worse. You might improve sign-ups by making the checkout so aggressive that you destroy trust and hurt long-term retention. Guardrails catch this.

If your primary metric is a rate (number of sign-ups per visitor), make sure the numerator and denominator count the same units. If you compute “sign-ups per session” but randomize by user, a user who has many sessions counts many times - this inflates your sample size estimate and understates variance.

Sample Size: Decide Before You Look

This is the most commonly skipped step. You must decide how many observations to collect before running the experiment, not by checking whether $p < 0.05$ yet.

The required sample size per group for a two-sample test is approximately:

$$n = \frac{2(z_{\alpha/2} + z_\beta)^2 \sigma^2}{\delta^2},$$

where:

$\alpha$ is the false positive rate (Type I error), typically $0.05$. Then $z_{\alpha/2} = 1.96$.
$\beta$ is the false negative rate (Type II error). Power $= 1 - \beta$, typically $80%$, so $z_\beta = 0.84$.
$\sigma^2$ is the variance of the outcome metric (estimated from historical data).
$\delta$ is the minimum detectable effect (MDE): the smallest true effect you care about detecting.

The key insight is the $\delta^2$ in the denominator. Detecting an effect half as large requires four times the sample size. Detecting a 0.5% improvement in sign-up rate instead of a 2% improvement requires 16 times more data. Experiments to detect tiny effects on large platforms require months of data.

Power analysis makes this concrete. Before running your experiment:

Estimate $\sigma^2$ from historical baseline data.
Decide the MDE: what is the smallest effect that would change your decision?
Plug into the formula. The answer is your required sample size.
Estimate how long it takes to collect that many users. If it is two years, reconsider whether this experiment is worth running, or whether you can find a more sensitive metric.

Underpowered experiments - run with too few observations to detect the effect you care about - are one of the most common failures in practice. They produce inconclusive results that get misinterpreted as evidence of no effect.

The Peeking Problem

Here is a trap that catches even experienced experimenters.

Your experiment is running. You check the dashboard daily. On day 8, the p-value crosses 0.05. You stop the experiment and declare a winner.

The problem: you were running 14 planned tests (one per day). Even if the null hypothesis is true - no real effect - the probability of ever seeing $p < 0.05$ over 14 daily checks is not 5%. It is much higher, perhaps 25-40%, depending on the stopping rule.

This is called optional stopping or the peeking problem. By stopping when you like based on the current p-value, you are effectively running multiple tests without correcting for it. Your false positive rate is far above the nominal $\alpha$.

The correct approach: pre-specify the sample size. Collect it. Look once.

If you need to monitor experiments in real time (business pressures are real), use sequential testing methods that are designed for continuous monitoring:

Alpha spending functions (Lan-DeMets): allocate your total $\alpha$ budget across planned interim analyses. Each interim test uses less than $\alpha$; the final test uses whatever remains. The overall false positive rate is controlled.
Always-valid inference: methods based on the sequential probability ratio test (SPRT) or e-values that provide valid inference at any stopping time. Used by companies like Netflix and Spotify.

These methods are more complex but allow you to stop early and maintain your false positive rate.

Multiple Testing

Your experiment measures 20 metrics. Even if the treatment has zero effect on any of them, you expect about one to show $p < 0.05$ by chance. Running many tests without correction inflates your Type I error rate.

The simplest correction is Bonferroni: divide your significance threshold by the number of tests. If you test 20 metrics at $\alpha = 0.05$, require $p < 0.05/20 = 0.0025$ for each individual test. This controls the family-wise error rate (FWER) - the probability of any false positive.

Bonferroni is conservative. When the tests are correlated (which they often are - related metrics tend to move together), it over-corrects. A less conservative alternative is the Benjamini-Hochberg procedure, which controls the false discovery rate (FDR): the expected proportion of rejected null hypotheses that are false positives. FDR control is more powerful when you have many tests and expect some to be truly different.

In practice: choose your primary metric in advance and apply no correction to it. Treat secondary metrics and guardrails as exploratory. Report them honestly but do not treat each as a separate confirmatory test.

SUTVA and Network Effects

The standard analysis of an A/B test relies on the Stable Unit Treatment Value Assumption (SUTVA):

The outcome for unit $i$ depends only on $i$’s own treatment assignment, not on the assignments of other units.

SUTVA fails on social networks. If you give user $A$ a new feature that changes how they interact with user $B$, and $B$ is in the control group, then $B$’s outcome is affected by $A$’s treatment. Your control group is contaminated. The treatment and control groups are no longer independent.

This is interference or spillover, and it is endemic to any platform where users interact. Typical examples:

Showing a user a new messaging feature changes how they interact with their contacts, some of whom are in control.
Testing a two-sided marketplace feature affects supply and demand for all users.
Testing a newsfeed algorithm for some users changes the content the algorithm has to work with for all users.

Solutions include:

Cluster randomization: randomize at the cluster level (e.g., all users in a geographic region get the same assignment). Spillover within clusters is allowed; you only worry about between-cluster spillover.
Ego network randomization: randomize at the level of social network neighborhoods.
Graph cluster randomization: partition the social graph into clusters using community detection, randomize at the cluster level.

All of these increase the effective unit of randomization, which reduces statistical power. There is no free lunch: handling interference requires larger experiments.

Common Pitfalls

Novelty effects. Users behave differently toward new things simply because they are new. A new UI might increase engagement for two weeks, then return to baseline as users habituate. Experiments that run for only a week capture novelty, not the long-run effect. Run experiments long enough for the novelty to wear off.

Simpson’s paradox. The aggregate data says one thing; the stratified data says the opposite. Example: overall sign-up rate increases in your experiment. But when you break down by device type, it decreases for both mobile users and desktop users. The aggregate increase is driven by a shift in traffic mix - the experiment happened to run during a period with unusually high desktop traffic (which has higher base conversion), not because the treatment worked.

Always check for Simpson’s paradox by stratifying on major covariates. If the direction of effect flips in subgroups, your aggregate estimate is suspect.

Leakage. Users assigned to control see elements of the treatment (e.g., they interact with a treated user’s behavior). The control group is no longer a clean counterfactual.

Survivorship bias. You measure only users who are still active at the end of the experiment. Users who churned are excluded. If the treatment affects churn, your metric is measured on a biased sample.

Underpowered tests. You run an experiment for one week with 1,000 users to detect a 0.1% improvement. The experiment cannot possibly detect this effect. You get $p = 0.4$, conclude “no effect”, and kill a feature that would have worked. This is a false negative, not evidence against the treatment.

Discomfort check. “We saw p = 0.04. We ship it.” But ask: what is the effect size? Is a 0.3% improvement in sign-up rate worth the engineering cost and maintenance burden? Is the test adequately powered - or is the confidence interval so wide that the true effect might be negative? Was the metric pre-specified, or did you choose it after seeing the results? Was the test pre-registered? The ritual of $p < 0.05$ without engaging with these questions is cargo-cult statistics. It produces a lot of shipped features and very little understanding of what actually works.

Beyond Binary A/B

Multi-armed bandits. Instead of fixed allocation, dynamically shift traffic toward the better-performing arm as the experiment runs. This exploits what you’ve learned while still exploring. Used when rapid adaptation matters more than precise causal identification, and when experiments have direct business cost (showing users an inferior experience while you gather data).

Factorial designs. Test multiple factors simultaneously. A $2 \times 2$ design tests two binary treatments in all four combinations. You can estimate main effects and interactions. If the two treatments interact (the effect of one depends on the other), factorial designs detect this; sequential A/B tests do not.

Adaptive experiments. More sophisticated designs update the allocation rule based on accumulating data. Response-adaptive randomization can concentrate data where statistical power is needed most.

Holdouts. For features that are difficult to turn off (infrastructure changes, algorithm retrains), a holdout group is held back from the launch and maintained as a control for weeks or months after the experiment. This measures long-run effects that a short experiment cannot.

Experimental Design in Machine Learning

Evaluating a machine learning model is itself an experimental design problem.

The train/validation/test split is a form of experimental design. The test set plays the role of the holdout sample in an RCT: it is untouched until the very end, providing an unbiased estimate of generalization performance. If you tune your model on the test set - even implicitly, by choosing the model that does best on it - you have peeked, and your final performance estimate is optimistic.

The replication crisis in ML - papers that don’t replicate, leaderboards that don’t translate to real-world performance - traces largely to poor experimental design: test set reuse, insufficient holdout samples, and multiple comparisons across many hyperparameter settings and model variants.

Proper ML evaluation:

Fix the test set at the start. Never look at it until you have a final model.
Use cross-validation on training data to choose hyperparameters.
Correct for multiple comparisons when comparing many model variants.
Report confidence intervals around performance metrics, not just point estimates.

These practices are identical in spirit to the controls that make A/B tests trustworthy. The statistics do not care whether the “treatment” is a green button or a new neural network architecture.

Summary

Concept	Description
Unit of randomization	Must match unit of analysis to avoid correlated errors
Sample size formula	$n = 2(z_{\alpha/2} + z_\beta)^2 \sigma^2 / \delta^2$; decide before running
Minimum detectable effect	Half the MDE requires 4x the data
Power	Typically 80%; probability of detecting a true effect
Peeking problem	Checking p-value daily inflates false positive rate; use sequential testing
Multiple testing	20 metrics → expect ~1 false positive; use Bonferroni or Benjamini-Hochberg
SUTVA	Assumes no interference between units; fails on social networks
Cluster randomization	Randomize at cluster level to handle interference; reduces power
Novelty effect	New things get extra engagement; run long enough for it to decay
Simpson’s paradox	Aggregate effect can reverse sign within subgroups
Multi-armed bandit	Dynamic allocation; trades off exploration and exploitation
Pre-registration	Commit to primary metric and stopping rule before seeing data

Experimental design is not a technicality you add after designing the experiment. It is the experiment. Getting the unit of randomization wrong, failing to pre-specify the sample size, peeking at the data as it comes in, testing twenty metrics and reporting the one that worked - these are not minor issues. They are the difference between learning what is true and accumulating a large, expensive collection of convincing-looking noise.

Run fewer experiments. Run them correctly.

Read next:

Machine Learning - What It Means to Generalize From Data