Variance Reduction - Getting More Signal From Every Sample
Helpful context:
- Probability as a Language - The Grammar of Uncertainty
- Concentration Inequalities - How Far Can a Random Variable Stray From Its Mean?
- Reinforcement Learning Theory - Learning to Act Without Being Told What Is Right
Every stochastic algorithm is a bet on the law of large numbers: average enough noisy estimates and the noise cancels out, leaving the signal. But how noisy your estimates are determines how many samples you need before the noise actually cancels. In stochastic gradient descent, a single mini-batch gradient is a noisy estimate of the true gradient. High variance forces you to use a small learning rate - large learning rates cause the optimizer to overshoot in the high-variance directions - and small learning rates mean slow convergence. In reinforcement learning, the situation is worse: the reward signal depends on an entire trajectory, which is a product of many random choices, and its variance grows combinatorially with the trajectory length. You might wait a thousand episodes before seeing enough variation in rewards to estimate which action was actually better.
Variance reduction is not about getting unbiased estimates - stochastic gradient descent is already unbiased. It is about getting estimates with smaller variance using the same number of samples. The techniques range from elementary (mini-batches) to sophisticated (GAE, control variates), but they all share the same underlying logic: find something correlated with your estimator whose expectation you already know, and use it to subtract out the fluctuations. The value of a $10%$ variance reduction compounds over thousands of training iterations: the effective sample size grows, learning rates can be increased, and convergence to a better optimum becomes achievable.
These methods sit at the intersection of statistics, optimization, and reinforcement learning theory. Understanding them requires thinking carefully about what “variance” means in a learning algorithm, which quantities you have control over, and what information you can extract from the problem structure before taking any samples at all. A clever variance reduction technique is not a trick; it is a way of encoding knowledge you already have about the problem into the estimator itself.
Importance Sampling
Suppose you want to compute $\mathbb{E}_p[f(X)]$ but sampling from $p$ is difficult or expensive. If you can sample from a different distribution $q$ (called the proposal), importance sampling rewrites the expectation:
$$\mathbb{E}_p[f(X)] = \int f(x) p(x) dx = \int f(x) \frac{p(x)}{q(x)} q(x) dx = \mathbb{E}_q\left[f(X) \frac{p(X)}{q(X)}\right].$$
The estimator $\hat{\mu}{\text{IS}} = \frac{1}{n}\sum{i=1}^n f(x_i) \frac{p(x_i)}{q(x_i)}$ with $x_i \sim q$ is unbiased: $\mathbb{E}[\hat{\mu}_{\text{IS}}] = \mathbb{E}_p[f(X)]$.
The ratio $w(x) = p(x)/q(x)$ is the importance weight. The variance of the estimator is:
$$\text{Var}(\hat{\mu}_{\text{IS}}) = \frac{1}{n}\text{Var}_q\left[f(X) w(X)\right] = \frac{1}{n}\left(\mathbb{E}_q[f(X)^2 w(X)^2] - \mu^2\right).$$
This can be dramatically larger than the variance under $p$. If $q$ puts very little mass on a region where $p$ has high mass and $f$ is large, then the importance weight $w(x) = p(x)/q(x)$ in that region is enormous, causing a few samples to have extreme values. The variance of the IS estimator is finite only if $\mathbb{E}_q[w(X)^2] < \infty$, i.e., if $\chi^2(p | q) < \infty$. When $p$ and $q$ have heavy tails in different places, this condition fails.
In off-policy RL, importance sampling is used to reuse experience collected under an old policy $\mu$ to estimate quantities under the current policy $\pi$:
$$\mathbb{E}\pi[G_t] = \mathbb{E}\mu\left[G_t \cdot \prod_{k=t}^{T} \frac{\pi(a_k | s_k)}{\mu(a_k | s_k)}\right].$$
The product of ratios along a trajectory can be exponentially large or small, making off-policy IS estimators extremely high-variance for long trajectories. Truncated IS (capping importance weights at some maximum value) introduces bias but controls variance.
Control Variates
The control variate technique is the most widely applicable variance reduction method. Suppose you want to estimate $\mu = \mathbb{E}[f(X)]$. If you know a function $g(X)$ whose expectation $\mu_g = \mathbb{E}[g(X)]$ is known analytically, then for any constant $c$:
$$\mathbb{E}[f(X) - c(g(X) - \mu_g)] = \mathbb{E}[f(X)] - c \cdot 0 = \mu.$$
The estimator $\hat{\mu}_c = \frac{1}{n}\sum_i (f(x_i) - c(g(x_i) - \mu_g))$ is unbiased for any $c$. Its variance is:
$$\text{Var}(\hat{\mu}_c) = \text{Var}(f) - 2c \text{Cov}(f, g) + c^2 \text{Var}(g).$$
This is minimized over $c$ by:
$$c^* = \frac{\text{Cov}(f, g)}{\text{Var}(g)}.$$
With optimal $c^*$, the variance reduction factor is:
$$\frac{\text{Var}(\hat{\mu}{c^*})}{\text{Var}(f/n)} = 1 - \rho{fg}^2$$
where $\rho_{fg} = \text{Corr}(f, g)$. A control variate with correlation $0.9$ reduces variance by $81%$, equivalent to multiplying the sample size by $5.26$.
The art is finding a $g$ that is highly correlated with $f$ and has a known expectation. For Monte Carlo integration, common choices include simpler functions that approximate $f$, or exact analytic results for nearby problems. For policy gradient methods, the natural choice is the value function - which brings us to baselines.
The Policy Gradient Baseline
The REINFORCE estimator for the policy gradient is:
$$\nabla_\theta J(\theta) = \mathbb{E}\pi\left[G_t \nabla\theta \log \pi_\theta(a_t | s_t)\right]$$
where $G_t = \sum_{k \geq t} \gamma^{k-t} r_k$ is the return from time $t$. This estimator has high variance because $G_t$ fluctuates across episodes even when the policy is the same.
The key identity: for any function $b(s_t)$ that does not depend on the action $a_t$:
$$\mathbb{E}\pi\left[b(s_t) \nabla\theta \log \pi_\theta(a_t | s_t)\right] = b(s_t) \mathbb{E}\pi\left[\nabla\theta \log \pi_\theta(a_t | s_t)\right] = b(s_t) \cdot 0 = 0$$
because $\mathbb{E}{a \sim \pi}[\nabla\theta \log \pi_\theta(a|s)] = \nabla_\theta \sum_a \pi_\theta(a|s) = \nabla_\theta 1 = 0$. This means:
$$\mathbb{E}\pi\left[(G_t - b(s_t)) \nabla\theta \log \pi_\theta(a_t | s_t)\right] = \mathbb{E}\pi\left[G_t \nabla\theta \log \pi_\theta(a_t | s_t)\right].$$
Adding a state-dependent baseline does not change the expected gradient - it is a control variate with zero-mean correction. But it does change the variance. The optimal baseline (minimizing gradient variance) is approximately $b^*(s_t) = V^\pi(s_t) = \mathbb{E}\pi[G_t | s_t]$, the expected return from state $s_t$. With this baseline, the estimator becomes $(G_t - V^\pi(s_t)) \nabla\theta \log \pi_\theta(a_t|s_t)$.
When the return is below its expected value, the gradient term is downweighted (the action performed worse than average); when above, it is upweighted. This centering dramatically reduces variance because $G_t - V^\pi(s_t)$ has much smaller magnitude than $G_t$ alone.
The Advantage Function
The quantity $A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)$ is the advantage function. It measures how much better action $a$ is compared to the average action in state $s$ under policy $\pi$.
Using the advantage in policy gradient:
$$\nabla_\theta J(\theta) = \mathbb{E}\pi\left[A^\pi(s_t, a_t) \nabla\theta \log \pi_\theta(a_t | s_t)\right].$$
This is the control variate interpretation: $V^\pi(s_t)$ is the control variate $g$, the action-selection score $\nabla_\theta \log \pi$ is uncorrelated with the state baseline by the identity above, and $A^\pi = G_t - V^\pi(s_t)$ is the centered estimand.
Intuitively: the advantage is zero-centered around the average. Positive advantage means “this action was better than average, increase its probability.” Negative advantage means “this action was worse than average, decrease its probability.” The gradient update is proportional to the deviation from average performance, not the absolute level of performance. This is why the baseline matters: if rewards are always large and positive (e.g., scores in the range 1000-1100), the gradient direction is the right, but the variance is enormous due to the scale. Subtracting the baseline centers the signal at zero, eliminating this scale-induced variance.
GAE: Generalized Advantage Estimation
How should we estimate $A^\pi(s_t, a_t)$ in practice? There is a fundamental bias-variance tradeoff.
Define the TD residual at time $t$:
$$\delta_t = r_t + \gamma V^\pi(s_{t+1}) - V^\pi(s_t).$$
This is the one-step TD error: the difference between the TD target $r_t + \gamma V^\pi(s_{t+1})$ and the current value estimate $V^\pi(s_t)$. Note $\mathbb{E}[\delta_t | s_t, a_t] = A^\pi(s_t, a_t)$ (exactly, when $V^\pi$ is exact).
One-step estimator ($k=1$): $\hat{A}_t^{(1)} = \delta_t$. Low variance (uses only one step of reward), but high bias (relies heavily on the accuracy of $V^\pi$).
Monte Carlo estimator ($k \to \infty$): $\hat{A}t^{(\infty)} = G_t - V^\pi(s_t) = \sum{k=0}^\infty \gamma^k r_{t+k} - V^\pi(s_t)$. Unbiased but high variance (the return fluctuates across trajectories).
$k$-step estimator: $\hat{A}t^{(k)} = \sum{l=0}^{k-1} \gamma^l \delta_{t+l}$. Bias decreases as $k$ increases; variance increases. The bias comes from using $V^\pi$ as a bootstrap value at the end of the $k$-step return; if $V^\pi$ is wrong, this introduces bias. Monte Carlo uses no bootstrap and is unbiased.
GAE (Schulman et al., 2015) takes an exponential average over all $k$-step estimators:
$$\hat{A}t^{\text{GAE}(\gamma, \lambda)} = \sum{k=0}^\infty (\gamma\lambda)^k \delta_{t+k} = \delta_t + (\gamma\lambda)\delta_{t+1} + (\gamma\lambda)^2 \delta_{t+2} + \cdots$$
where $\lambda \in [0, 1]$ is the GAE parameter. When $\lambda = 0$: $\hat{A}_t = \delta_t$ (one-step TD, low variance, high bias). When $\lambda = 1$: $\hat{A}_t = G_t - V^\pi(s_t)$ (Monte Carlo, unbiased, high variance). Intermediate $\lambda$ interpolates between these extremes.
In practice (PPO, A3C, etc.), $\lambda = 0.95$ and $\gamma = 0.99$ are common. The discounting by $(\gamma\lambda)^k$ naturally downweights distant TD errors, both reducing variance and giving recent rewards more influence. GAE is now a standard component of nearly all policy gradient methods.
Mini-Batches and Shuffling
The simplest variance reduction technique is averaging: if each sample has variance $\sigma^2$, the average of $B$ iid samples has variance $\sigma^2 / B$. Mini-batch SGD with batch size $B$ reduces gradient variance by a factor of $B$ relative to single-sample SGD.
But samples in a mini-batch are not iid - they are drawn from the same dataset. If the dataset has structure (similar examples appear in adjacent rows), then samples within a mini-batch are correlated, and the variance reduction is less than $1/B$. This is why shuffling the dataset before each epoch is important: it breaks correlations between successive mini-batches and ensures that the batches are closer to iid samples from the data distribution.
The relationship between batch size and variance also has implications for the effective learning rate. The optimal learning rate scales approximately as $\eta \propto \sqrt{B}$ (square root scaling) or $\eta \propto B$ (linear scaling) depending on the optimizer and the noise structure. Doubling the batch size roughly halves the variance, allowing the learning rate to increase proportionally to maintain the same signal-to-noise ratio.
Large batch training faces a wall: beyond a critical batch size, the variance of the stochastic gradient is dominated by curvature (how much the gradient changes from point to point), not noise. Additional batch size no longer reduces effective variance, and the computational cost is wasted.
Antithetic Variates and Stratified Sampling
Two classical Monte Carlo techniques deserve mention for their elegance and their appearance in modern ML.
Antithetic variates. Suppose you want to estimate $\mu = \mathbb{E}[f(U)]$ where $U \sim \text{Uniform}(0,1)$. Instead of drawing $n$ iid samples $U_1, \ldots, U_n$, draw $n/2$ samples and use pairs $(U_i, 1 - U_i)$. The estimator is:
$$\hat{\mu} = \frac{1}{n}\sum_{i=1}^{n/2}\left(f(U_i) + f(1 - U_i)\right).$$
The variance is:
$$\text{Var}(\hat{\mu}) = \frac{\text{Var}(f(U)) + \text{Cov}(f(U), f(1-U))}{n/2}.$$
If $f$ is monotone, then $\text{Cov}(f(U), f(1-U)) < 0$ (when $U$ is large, $1-U$ is small, and a monotone $f$ maps them to opposite ends of its range). Negative covariance reduces the variance below $\text{Var}(f)/n$ - better than iid sampling. In RL, this appears in paired rollouts: run two episodes with opposite random seeds to ensure that the variability in one is counterbalanced by the other.
Stratified sampling. Divide the sample space into $K$ strata (non-overlapping regions). Draw $n_k$ samples from each stratum (with $\sum_k n_k = n$). The estimator is $\hat{\mu} = \sum_k w_k \hat{\mu}_k$ where $w_k$ is the probability of stratum $k$ and $\hat{\mu}_k$ is the stratum-level average.
Stratified sampling always has variance at most as large as iid sampling, and strictly less if the stratum means differ. Optimal allocation (Neyman allocation) puts more samples in strata with higher variance: $n_k \propto w_k \sigma_k$. In practice, stratified sampling is used in importance-weighted experience replay and curriculum learning to ensure that different types of examples contribute proportionally to each gradient step.
Clipping as Variance Control
A practical variance reduction technique that appears throughout modern RL is clipping importance weights.
In PPO (Proximal Policy Optimization), the policy gradient objective is:
$$\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) \hat{A}_t,; \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right]$$
where $r_t(\theta) = \pi_\theta(a_t|s_t) / \pi_{\theta_\text{old}}(a_t|s_t)$ is the importance ratio. The clip prevents the ratio from straying far from 1, which prevents large importance weights from causing catastrophic updates. The clipping introduces a small bias (the gradient is zeroed out when the ratio is outside $[1-\epsilon, 1+\epsilon]$) but dramatically reduces the variance of the policy gradient estimator when policies differ.
In RLHF (Reinforcement Learning from Human Feedback), reward clipping is similarly used: cap the reward at some maximum value to prevent extreme rewards from dominating the gradient. This is a form of Huber-loss-style robustness, trading a small amount of bias for a large reduction in the influence of reward outliers.
The tension in all clipping approaches is the bias-variance tradeoff made explicit: tighter clipping reduces variance but introduces more bias (the gradient estimator is no longer unbiased for the true objective). Choosing $\epsilon = 0.2$ in PPO is an empirical choice that works well across many tasks - not a theoretically derived optimum.
Summary
| Technique | Variance reduction mechanism | Tradeoff |
|---|---|---|
| Mini-batches | Average over $B$ samples: variance $\propto 1/B$ | Computational cost scales with $B$ |
| Importance sampling | Reuse off-policy data; unbiased | Variance blows up if $p/q$ ratio is large |
| Control variates | Subtract correlated term with known mean | Need to find $g$ correlated with $f$ |
| Baseline in policy gradient | Subtract $V(s)$; gradient unchanged | Need accurate value function estimate |
| Advantage function | $A(s,a) = Q - V$; centered around zero | Same requirement as baseline |
| GAE ($\lambda$) | Exponential average of $k$-step returns | $\lambda$ trades bias vs. variance |
| Antithetic variates | Negative correlation between paired samples | Only works for monotone functions |
| Clipping (PPO) | Limit importance ratio to $[1-\epsilon,1+\epsilon]$ | Introduces bias; removes worst-case variance |
| Concept | Definition |
|---|---|
| Importance weight | $w(x) = p(x)/q(x)$; reweights samples from $q$ to estimate expectations under $p$ |
| Control variate | Function $g$ with known mean $\mu_g$; use $f - c(g - \mu_g)$ as unbiased estimator |
| Optimal $c$ | $c^* = \text{Cov}(f,g)/\text{Var}(g)$; minimizes estimator variance |
| Variance reduction factor | $1 - \rho^2_{fg}$; fraction of variance remaining after optimal control variate |
| TD residual $\delta_t$ | $r_t + \gamma V(s_{t+1}) - V(s_t)$; one-step temporal difference error |
| GAE | $\hat{A}t = \sum_k (\gamma\lambda)^k \delta{t+k}$; $\lambda=0$ gives TD; $\lambda=1$ gives MC |
| REINFORCE baseline | Subtracting $b(s)$ from $G_t$ reduces variance without changing expected gradient |
| PPO clipping | Caps ratio $\pi_\theta/\pi_{\text{old}}$ at $1\pm\epsilon$; limits off-policy variance |
Read next: