Modern Optimizer Zoo
Prerequisite:
Optimization algorithms are the engines of deep learning. The choice of optimizer, its hyperparameters, and its interaction with the loss landscape determine whether a model trains at all, how fast, and how well it generalizes. This post covers the main optimizer families - from vanilla gradient descent to Adam and its variants - with formal convergence guarantees where they exist and honest discussion of where the theory breaks down.
Gradient Descent and SGD
Gradient Descent (GD) computes the full gradient over the entire dataset at each step:
$$x_{t+1} = x_t - \eta \nabla f(x_t),$$
where $\eta > 0$ is the learning rate. For a dataset of $n$ examples with loss $f(x) = \frac{1}{n}\sum_{i=1}^n \ell_i(x)$, each step costs $O(n)$ gradient evaluations.
Theorem (GD convergence - convex). If $f$ is convex with $L$-Lipschitz gradient and $\eta = 1/L$:
$$f(x_T) - f^\ast \leq \frac{L|x_0 - x^\ast|^2}{2T}.$$
Rate: $O(1/T)$ iterations.
Theorem (GD convergence - strongly convex). If $f$ is $\mu$-strongly convex with $L$-Lipschitz gradient and $\eta = 2/(\mu + L)$:
$$|x_T - x^\ast|^2 \leq \left(\frac{\kappa - 1}{\kappa + 1}\right)^{2T} |x_0 - x^\ast|^2,$$
where $\kappa = L/\mu$. This is linear convergence - the log-error decreases by $\Theta(1/\kappa)$ per step.
Stochastic Gradient Descent (SGD) uses a single randomly sampled gradient per step:
$$x_{t+1} = x_t - \eta_t \nabla \ell_{i_t}(x_t), \quad i_t \sim \text{Uniform}{1, \ldots, n}.$$
Each step is $O(1)$ rather than $O(n)$. The stochastic gradient $\nabla \ell_{i_t}$ is an unbiased estimator of $\nabla f(x_t)$ but has variance. With diminishing step sizes $\eta_t = c/\sqrt{t}$, SGD achieves $O(1/\sqrt{T})$ convergence for convex problems - slower per iteration than GD, but often faster per gradient evaluation because $O(n)$ fewer gradients are computed per unit of progress.
Mini-batch SGD averages gradients over $B$ samples per step, reducing variance by $1/B$ while keeping cost at $O(B)$. Batch size $B = 1$ is pure SGD; $B = n$ is GD. In practice, $B \in {32, 128, 512}$ balances hardware utilization (GPU parallelism) and gradient quality.
GD: full gradient, smooth descent
f(x)
| *
| *
| *
| * x*
+-----------
SGD: noisy, bounces around, but cheaper per step
f(x)
| * *
| * * *
| * * *
| * near x*
+-----------
The noise in SGD is not always a bug: it acts as implicit regularization and can escape sharp local minima, which may explain part of its generalization advantage over full-batch GD in deep learning.
Momentum
Polyak Momentum (classical momentum) accumulates a velocity vector:
$$m_{t+1} = \beta m_t + \nabla f(x_t), \quad x_{t+1} = x_t - \eta m_{t+1},$$
where $\beta \in [0, 1)$ is the momentum coefficient (typically $0.9$). Equivalently:
$$x_{t+1} = x_t - \eta \nabla f(x_t) - \beta(x_{t-1} - x_t) \cdot (\text{previous step contribution}).$$
The update is a weighted sum of all past gradients with exponentially decaying weights: $m_t = \sum_{s=0}^t \beta^{t-s} \nabla f(x_s)$. In directions where gradients consistently point the same way, $m_t$ grows large (fast progress); in oscillating directions, gradients cancel (damping).
For a quadratic $f(x) = \frac{1}{2}x^T A x$ with condition number $\kappa$, optimal momentum with $\beta = \left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^2$ achieves convergence in $O(\sqrt{\kappa})$ steps versus $O(\kappa)$ for gradient descent - a square-root improvement.
Nesterov Accelerated Gradient
Nesterov’s accelerated gradient (NAG) is sometimes called the “look-ahead” method because it evaluates the gradient at a predicted future point:
$$y_{t+1} = x_t + \beta(x_t - x_{t-1})$$ $$x_{t+1} = y_{t+1} - \eta \nabla f(y_{t+1}).$$
The extrapolated point $y_{t+1}$ is where plain momentum would move; NAG corrects from there.
Theorem (Nesterov, 1983). For convex $f$ with $L$-Lipschitz gradient, NAG achieves
$$f(x_T) - f^\ast \leq \frac{2L|x_0 - x^\ast|^2}{(T+1)^2}.$$
Rate: $O(1/T^2)$ versus $O(1/T)$ for plain gradient descent. This is the optimal rate for first-order methods on smooth convex functions - no algorithm that only uses gradients can do better (Nesterov’s lower bound).
For $\mu$-strongly convex $f$, NAG achieves linear convergence with rate $\rho = 1 - 1/\sqrt{\kappa}$, compared to $\rho = 1 - 1/\kappa$ for GD. The improvement $\kappa \to \sqrt{\kappa}$ is fundamental.
In deep learning, momentum is nearly universal but Nesterov’s exact scheme is less common. The intuition carries over: look ahead before correcting.
AdaGrad
All methods so far use a single global learning rate $\eta$. AdaGrad (Duchi et al., 2011) adapts the learning rate per parameter based on cumulative gradient history:
$$G_{t,i} = \sum_{s=1}^t g_{s,i}^2, \quad x_{t+1,i} = x_{t,i} - \frac{\eta}{\sqrt{G_{t,i} + \epsilon}} g_{t,i},$$
where $g_{t,i} = \partial f(x_t)/\partial x_i$ and $\epsilon$ is a small constant for numerical stability (typically $10^{-8}$).
The effective learning rate for parameter $i$ is $\eta / \sqrt{G_{t,i}}$: parameters that receive large gradients frequently get smaller steps; parameters with rare or small gradients get larger steps. This is particularly effective for sparse gradient problems (e.g., word embeddings where most coordinates are zero most of the time - only the relevant vocabulary entries get large gradient updates).
The drawback: $G_{t,i}$ only grows, so the effective learning rate monotonically decreases. Training can stall before convergence, especially in the non-convex deep learning setting.
RMSprop
RMSprop (Hinton, unpublished, 2012) fixes AdaGrad’s monotone decay by using an exponential moving average of squared gradients:
$$v_t = \gamma v_{t-1} + (1 - \gamma) g_t^2, \quad x_{t+1} = x_t - \frac{\eta}{\sqrt{v_t + \epsilon}} g_t,$$
where $\gamma \in (0, 1)$ is the decay rate (typically $0.9$). The accumulator $v_t$ tracks recent gradient magnitudes rather than the full history, allowing the effective learning rate to recover if gradients become small.
RMSprop works well for recurrent networks and non-stationary objectives, where the gradient scale can change dramatically during training.
Adam
Adam (Kingma & Ba, 2014) combines momentum (first moment) with RMSprop (second moment) and adds bias correction:
$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment estimate)}$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(second moment estimate)}$$ $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}t = \frac{v_t}{1 - \beta_2^t} \quad \text{(bias correction)}$$ $$x{t+1} = x_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t.$$
Default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, $\eta = 10^{-3}$.
Bias correction is necessary because $m_0 = v_0 = 0$: early in training, $m_t$ and $v_t$ are biased toward zero. Dividing by $1 - \beta_1^t$ (which is close to zero early on) inflates the estimates back to their true scale.
Why Adam works so well in practice:
- Per-parameter learning rates adapt to the gradient scale - parameters with large gradients move slowly, those with small gradients move faster. This is robust to poor choice of global $\eta$.
- The first moment provides momentum, smoothing noisy gradients.
- The effective step size $\eta \hat{m}_t / (\sqrt{\hat{v}_t} + \epsilon)$ has a bounded magnitude - roughly $\eta$ - regardless of gradient scale. This prevents runaway updates.
Adam is the default optimizer for transformers, diffusion models, and most deep learning tasks.
AdamW
AdamW (Loshchilov & Hutter, 2019) decouples weight decay from the gradient update. Naive Adam with L2 regularization modifies the gradient: $g_t \leftarrow g_t + \lambda x_t$. This interacts badly with the adaptive scaling - the weight decay contribution gets divided by $\sqrt{v_t}$, which is not the intended behavior.
AdamW instead applies weight decay directly to the parameters:
$$x_{t+1} = x_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t - \eta \lambda x_t.$$
The weight decay step $\eta \lambda x_t$ is independent of the gradient history. This is the standard practice in transformer training (BERT, GPT, etc.) and typically improves generalization.
The Adam Convergence Controversy
Despite its empirical success, Adam does not always converge, even on convex problems.
Reddi et al. (2018) exhibited a simple convex example where Adam fails to converge. The issue is that the adaptive step size can oscillate: in directions where gradients occasionally spike, $v_t$ grows large (reducing the step), but when the spike is rare, $v_t$ decays back (increasing the step before the next spike). The effective learning rate can fail to diminish appropriately.
AMSGrad (Reddi et al., 2018) fixes this by using a monotonically non-decreasing second moment: $\hat{v}t = \max(\hat{v}{t-1}, v_t)$. This guarantees convergence in the convex setting. In practice, AMSGrad is rarely used - Adam still works better empirically even if its theoretical convergence is not guaranteed.
The gap between Adam’s theory and practice is a major open problem in optimization for deep learning.
LAMB and LARS
LARS (You et al., 2017) and LAMB (You et al., 2020) are designed for large-batch training - using $B \in {8192, 32768}$ or more samples per step to exploit distributed hardware.
The key insight: large batches reduce gradient noise, so the optimal learning rate scales with $\sqrt{B}$ (linear scaling rule). But this creates instability for some layers (especially normalization layers). LARS and LAMB apply layer-wise learning rate adaptation: each layer gets a learning rate proportional to the ratio of its parameter norm to its gradient norm:
$$\eta_\ell = \eta \cdot \frac{|w_\ell|}{|\nabla_\ell f|}.$$
LAMB combines this with Adam’s moment estimates. It was used to train BERT in 76 minutes (instead of days) with batch size 32768.
Learning Rate Schedules
The learning rate $\eta$ is often not constant - it follows a schedule.
Cosine annealing: $\eta_t = \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\left(\frac{t}{T}\pi\right)\right)$. The learning rate starts at $\eta_{\max}$, smoothly decreases to $\eta_{\min}$, optionally restarting (cosine with restarts / SGDR). Very common in practice.
Linear warmup: For the first $T_w$ steps, increase $\eta$ linearly from 0 to $\eta_{\max}$. Warmup prevents instability early in training when parameters are far from the optimum and gradient estimates are noisy. Transformer training universally uses warmup + cosine decay.
Cyclical learning rates (CLR): Cycle $\eta$ between $\eta_{\min}$ and $\eta_{\max}$ on a fixed schedule. Smith (2017) showed this can improve final accuracy - the cyclic exploration helps escape sharp minima.
Step decay / polynomial decay: Simple schedules that reduce $\eta$ by a fixed factor every $k$ steps or follow $\eta_t = \eta_0 / t^\alpha$.
Cosine annealing: Linear warmup + cosine:
eta eta
| . | . .
| . . | . .
| . . . | . .
| .. .. | . .
+---t-------T---> | . .
+--Tw------T-------->
Summary: Choosing an Optimizer
| Optimizer | Best for | Weakness |
|---|---|---|
| SGD + Nesterov | Convex problems, CNNs with tuned LR | Sensitive to learning rate |
| Adam | Default for transformers, RL | May not converge (Reddi 2018) |
| AdamW | Transformers with weight decay | Needs warmup |
| AdaGrad | Sparse gradients (NLP, embeddings) | LR monotone decay |
| RMSprop | RNNs, non-stationary | No bias correction |
| LAMB | Large-batch training | Extra hyperparameter |
In practice: start with AdamW and a warmup + cosine schedule. If training is convex or nearly convex (linear models, small networks), SGD with momentum may generalize better. If memory is the constraint (very large models), use 8-bit Adam or other quantized variants.
The optimizer choice interacts with batch size, architecture, initialization, and regularization - there is no universally best answer, and the theory for non-convex neural network optimization remains far behind practice.
Read Next: