Helpful context:


Here is the entire gradient descent algorithm:

$$\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \nabla f(\mathbf{x}_t).$$

Compute the gradient. Scale it by the learning rate $\alpha$. Subtract. Repeat.

That is all. No special structure. No adaptive logic. No memory of previous steps.

And yet, nobody uses this algorithm to train modern neural networks. The researchers building the largest language models, the engineers deploying production image classifiers, the scientists training protein structure predictors - none of them use vanilla gradient descent. They use Adam, AdamW, RMSProp, SGD with momentum, or variants thereof. These are all elaborations of the simple rule above. They all compute gradients and subtract them. But they add machinery that makes training dramatically more practical.

What is that machinery, and why is it needed?


Section 1: Three Problems with Vanilla Gradient Descent

There are three distinct problems that the modern optimizer zoo addresses. Understanding each of them explains the design of the algorithms.

Problem 1: Speed. For a convex $L$-smooth function, gradient descent converges at $O(1/t)$. After $t$ steps, the function value is within $O(1/t)$ of the optimum. To halve the error, you double the number of steps. This is a polynomial rate - manageable for moderate precision, agonizing for high precision.

For non-convex deep networks, there is no convergence guarantee at all. The theoretical tools from convex optimization do not apply directly. But in practice, training a neural network requires the loss to drop by many orders of magnitude. A fixed step size with $O(1/t)$ intuition means the early phases of training go fast (loss drops rapidly) but the later phases are slow (the loss plateau is hard to escape). Vanilla gradient descent often stalls in the later phases.

Problem 2: Heterogeneous gradients. In a deep neural network, different parameters receive wildly different gradient magnitudes. A weight in the last fully-connected layer might receive gradients of order $1$ or $0.1$. A weight in the first convolutional layer, receiving gradients through many layers of backpropagation, might receive gradients of order $10^{-6}$.

With a single global learning rate $\alpha$, you face an impossible tradeoff: $\alpha$ large enough to make progress on the small-gradient parameters is dangerously large for the large-gradient parameters (causing oscillations and divergence). $\alpha$ small enough to be safe for the large-gradient parameters leaves the small-gradient parameters essentially frozen. You need different effective step sizes for different parameters.

Problem 3: Gradient noise. Computing the full gradient requires summing over the entire training set:

$$\nabla f(\mathbf{x}) = \frac{1}{n}\sum_{i=1}^n \nabla \ell(\mathbf{x}; z_i).$$

For $n = 10^8$ training examples, each gradient computation requires a full pass over the data. This is expensive - impractically so when you need thousands of gradient steps. The solution is to estimate the gradient from a random mini-batch:

$$\hat{\nabla} f(\mathbf{x}) = \frac{1}{B}\sum_{i \in \mathcal{B}} \nabla \ell(\mathbf{x}; z_i), \quad |\mathcal{B}| = B.$$

This is $n/B$ times cheaper per step, but the estimate has variance proportional to $1/B$. The noise in the gradient estimate must be managed.

Each of the optimizers we discuss addresses some combination of these three problems.


Section 2: Stochastic Gradient Descent

SGD replaces the full gradient with a mini-batch estimate at each step:

$$\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \hat{\nabla} f(\mathbf{x}_t).$$

The update is cheap: $O(B)$ instead of $O(n)$ gradient evaluations. With $B = 256$ and $n = 10^8$, each step is roughly $4 \times 10^5$ times cheaper than a full-batch step. This enables orders of magnitude more steps per unit wall-clock time, and empirically, many cheap noisy steps converge faster than few expensive exact steps.

Convergence of SGD. The mini-batch gradient is an unbiased estimate of the true gradient: $\mathbb{E}[\hat{\nabla} f(\mathbf{x})] = \nabla f(\mathbf{x})$. On average, each step moves in the right direction. The noise adds variance around the correct mean.

With a constant learning rate, SGD does not converge to the exact minimum but oscillates in a neighborhood. The radius of oscillation is proportional to $\alpha$ times the gradient noise. With a decaying learning rate schedule $\alpha_t \to 0$ satisfying $\sum_t \alpha_t = \infty$ and $\sum_t \alpha_t^2 < \infty$ (e.g., $\alpha_t = c/\sqrt{t}$), SGD converges almost surely for convex objectives.

Discomfort check. If the gradient estimate is noisy, how does SGD make reliable progress? The gradient could point in the wrong direction.

For a single step: the gradient estimate could indeed point in a bad direction, especially with small batch sizes. But in expectation, it points correctly. Over many steps, the noise averages out (random directions cancel), and the signal (the true gradient direction) accumulates. This is the law of large numbers applied to the sequence of gradient estimates.

More precisely: define the gradient noise as $\xi_t = \hat{\nabla} f(\mathbf{x}_t) - \nabla f(\mathbf{x}_t)$. This has mean zero and bounded variance (under standard assumptions). The SGD iterate satisfies the same descent lemma as gradient descent, plus a noise term that averages to zero. When the learning rate decays, the noise term vanishes, and convergence follows.

The noise as a feature. SGD noise does more than average away. It actively helps in one important way: it prevents the optimizer from settling into sharp, narrow minima. A sharp minimum is one where the loss increases rapidly in some direction - a tall thin valley. The noise in SGD steps has enough variance to “kick” the optimizer out of these tall thin valleys, preferentially leaving it in wide, flat minima.

Flat minima generalize better. Intuitively: a flat minimum is one where small perturbations to the weights change the loss only slightly. A sharp minimum is sensitive to perturbations. Since test data is always somewhat different from training data (distribution shift), the model evaluated on test data is, in a sense, a perturbed version of the model at training. Flat minima degrade gracefully; sharp minima can fail catastrophically.

This is one reason SGD with momentum sometimes generalizes better than adaptive methods like Adam on image classification benchmarks, despite Adam converging to lower training loss faster.


Section 3: Momentum

SGD with a constant learning rate makes slow progress through ravines. A ravine is an elongated valley: the loss decreases steeply in one direction (the walls of the ravine) but very gently in another (the floor, pointing toward the minimum). In the steep direction, large gradients cause oscillations - the step overshoots, the gradient reverses, the next step overshoots back. In the gentle direction, small gradients mean slow progress.

The cumulative effect: much of the gradient computation is wasted on the oscillating direction, and progress toward the minimum is slow.

Momentum adds a velocity to the optimizer. Instead of using only the current gradient, the update accumulates an exponentially weighted moving average of past gradients:

$$\mathbf{v}t = \beta \mathbf{v}{t-1} + (1-\beta)\nabla f(\mathbf{x}_t),$$

$$\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \mathbf{v}_t.$$

The hyperparameter $\beta \in [0, 1)$ controls the memory length. With $\beta = 0$, $\mathbf{v_t} = \nabla f(\mathbf{x_t})$ - no memory, just the current gradient. With $\beta = 0.9$, the velocity is an exponentially decaying average of approximately $1/(1-0.9) = 10$ past gradients.

Why momentum helps in ravines.

In the oscillating direction (the ravine walls): the gradient alternates sign each step - positive, then negative, then positive. The velocity in this direction is:

$$v = g - \beta g + \beta^2 g - \cdots \approx \frac{g}{1 + \beta}.$$

With $\beta = 0.9$: the effective gradient magnitude in the oscillating direction is roughly $g/1.9 \approx g/2$. The oscillations are damped.

In the consistent direction (along the ravine floor): the gradient always points the same way. The velocity accumulates:

$$v = g + \beta g + \beta^2 g + \cdots = \frac{g}{1-\beta}.$$

With $\beta = 0.9$: the effective gradient in the consistent direction is roughly $g/0.1 = 10g$. The progress is amplified tenfold.

Momentum simultaneously damps oscillations and amplifies consistent progress. The ball rolls through flat spots (it has built up velocity), shoots down consistent slopes (velocity accumulates), and does not oscillate as wildly in steep directions (opposite-sign gradients cancel in the velocity).


Section 4: Nesterov Momentum

Nesterov’s modification looks small - computing the gradient at a “lookahead” position instead of the current position - but it has meaningful consequences.

Standard momentum: compute gradient at current position $\mathbf{x_t}$, then update.

Nesterov momentum: compute gradient at the anticipated position $\mathbf{x_t} - \beta\mathbf{v_{t-1}}$ (where the current velocity would take you), then correct:

$$\mathbf{v}t = \beta \mathbf{v}{t-1} + (1-\beta)\nabla f(\mathbf{x}t - \beta\mathbf{v}{t-1}),$$

$$\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \mathbf{v}_t.$$

Physical intuition. Standard momentum is like a ball that first looks at where it is now, then commits to its velocity. Nesterov momentum is like a ball that projects where its current velocity will take it, looks at the gradient there, and incorporates that information before committing. If the lookahead position is on the downslope of a hill that standard momentum is about to overshoot, Nesterov can see this and slow down in advance.

Theoretical consequence. For smooth convex objectives, Nesterov momentum achieves $O(1/t^2)$ convergence - the same as Nesterov accelerated gradient descent from the previous post. Standard gradient descent with momentum achieves only $O(1/t)$. The Nesterov modification is what creates the acceleration.

In practice, for deep learning, the difference between standard and Nesterov momentum is often small. Nesterov is used in some frameworks as the default (e.g., it is the nesterov=True option in PyTorch’s SGD). The theoretical advantage matters more for convex problems than for the non-convex neural network case.


Section 5: AdaGrad

Momentum addresses the speed and oscillation problems but not the heterogeneous gradient problem. Every coordinate still uses the same learning rate $\alpha$.

AdaGrad adapts the learning rate per parameter based on its gradient history.

Let $g_{t,j}$ denote the gradient with respect to parameter $j$ at step $t$. AdaGrad maintains a running sum of squared gradients for each parameter:

$$G_{t,j} = G_{t-1,j} + g_{t,j}^2.$$

The update for parameter $j$ uses a per-parameter learning rate:

$$x_{t+1,j} = x_{t,j} - \frac{\alpha}{\sqrt{G_{t,j} + \epsilon}} g_{t,j}.$$

The denominator $\sqrt{G_{t,j}}$ is large for parameters that have historically received large gradients (either in magnitude or frequency), reducing their effective learning rate. For parameters with small or infrequent gradients, $\sqrt{G_{t,j}}$ remains small and the effective learning rate remains large.

Why this helps with heterogeneous gradients. A parameter deep in the network that receives small gradients has a small $G_{t,j}$, so its effective learning rate is large - it takes big steps when it does get a gradient signal. A parameter in the output layer that receives large, frequent gradients has a large $G_{t,j}$, so its effective learning rate is small - it moves cautiously.

AdaGrad is particularly effective for sparse gradients. In a text model with a large vocabulary, the embedding for a rare word has a nonzero gradient only when that word appears in the batch. Most steps give it a zero gradient (no update). When the word does appear, AdaGrad gives it a relatively large effective learning rate (since its $G_{t,j}$ has accumulated little). AdaGrad learns quickly from rare events.

The problem: dying learning rates. $G_{t,j}$ is a cumulative sum of squared gradients. It only increases, never decreasing. After enough steps, $G_{t,j}$ becomes large for all parameters, and all effective learning rates decay toward zero. The optimizer eventually stops making progress.

For online learning with a fixed, finite training set (or when early stopping is used), this dying behavior can actually be desirable - the learning rate naturally decays as the model converges. But for neural networks trained over many epochs, this is a serious problem.


Section 6: RMSProp

RMSProp fixes AdaGrad’s dying learning rate by replacing the cumulative sum of squared gradients with an exponentially decaying moving average:

$$G_{t,j} = \beta G_{t-1,j} + (1-\beta) g_{t,j}^2.$$

With the same update rule as AdaGrad:

$$x_{t+1,j} = x_{t,j} - \frac{\alpha}{\sqrt{G_{t,j} + \epsilon}} g_{t,j}.$$

The parameter $\beta$ (typically $0.9$, $0.99$, or $0.999$) controls the memory length: the moving average forgets old gradient history at a rate $(1-\beta)$, so the effective memory is approximately $1/(1-\beta)$ steps.

Key difference from AdaGrad. $G_{t,j}$ is now bounded: it tracks a running average, not an unbounded cumulative sum. If the gradients for parameter $j$ are consistently small for many steps, $G_{t,j}$ decays, and the effective learning rate increases. The optimizer can recover its step size after a period of small gradients.

If the gradient for parameter $j$ suddenly becomes large, $G_{t,j}$ increases rapidly, reducing the step size quickly. The adaptation is local in time: only recent gradient history matters.

RMSProp normalizes each gradient by the root-mean-square of its recent values. The effective update is roughly:

$$\frac{g_{t,j}}{\sqrt{G_{t,j}}} \approx \frac{g_{t,j}}{\text{RMS}(g_{\cdot,j})}.$$

This quantity is scale-free: if you multiply all gradients for parameter $j$ by a constant factor, the effective update is unchanged (numerator and denominator scale equally). RMSProp is robust to the scale of the gradients.

RMSProp was proposed by Geoff Hinton in a Coursera lecture (never formally published, which is unusual for such an influential algorithm). It became the standard optimizer for recurrent neural networks before Adam arrived.


Section 7: Adam

Adam combines the ideas of momentum (first moment estimation) and RMSProp (second moment estimation) into a single optimizer, and adds bias correction to handle the initialization.

The algorithm.

Initialize $m_0 = 0$ (first moment), $v_0 = 0$ (second moment), and step counter $t = 0$.

At each step:

Compute the mini-batch gradient $g_t = \hat{\nabla} f(\mathbf{x_t})$.

Update the first moment (exponential moving average of gradients):

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t.$$

Update the second moment (exponential moving average of squared gradients):

$$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2.$$

Apply bias correction:

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \qquad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}.$$

Update the parameters:

$$\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}.$$

The default hyperparameters: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 10^{-8}$, $\alpha = 10^{-3}$.

Discomfort check. Why bias correction? The formulas with $\hat{m}_t$ and $\hat{v}_t$ add complexity - what are they fixing?

At initialization, $m_0 = 0$ and $v_0 = 0$. After the first step with gradient $g_1$:

$m_1 = \beta_1 \cdot 0 + (1-\beta_1) g_1 = (1-\beta_1) g_1$.

But the true first moment (the mean of all gradients so far, which is just $g_1$) is $g_1$. The estimate $m_1 = (1-\beta_1) g_1 = 0.1 g_1$ is biased toward zero by a factor of $(1-\beta_1) = 0.1$. Similarly for the second moment.

Dividing by $(1-\beta_1^t)$ corrects this. At $t = 1$: $\hat{m}_1 = m_1/(1-\beta_1) = g_1$. At $t = 10$: $\beta_1^{10} = 0.9^{10} \approx 0.35$, so the correction factor is $1/0.65 \approx 1.54$ - still significant. At $t = 100$: $\beta_1^{100} = 0.9^{100} \approx 2.7 \times 10^{-5}$, the correction is negligible.

The bias correction matters most in the early steps of training, when the moving averages have not yet accumulated enough history to be accurate. Without it, Adam takes very small steps at the start, even if the true gradient is large - poor behavior in the critical early phase.

Why Adam works across diverse problems.

The effective update per parameter is:

$$\Delta x_j = \alpha \frac{\hat{m}{t,j}}{\sqrt{\hat{v}{t,j}} + \epsilon}.$$

The numerator $\hat{m}{t,j}$ is the momentum - the trend of the gradient. The denominator $\sqrt{\hat{v}{t,j}}$ is the estimated standard deviation of recent gradients. Their ratio is approximately a signal-to-noise ratio: how large is the trend relative to the variability?

Parameters whose gradient is consistently large (low variability) get a step of roughly $\alpha$ regardless of the raw gradient magnitude - the large gradient is divided by the large denominator. Parameters whose gradient oscillates wildly (high variability) get a smaller step. The update magnitudes are automatically bounded to be roughly $\alpha$.

This scale-independence is what makes Adam work with a single learning rate across diverse architectures, datasets, and tasks. The same $\alpha = 10^{-3}$ often works for language models, image classifiers, and reinforcement learning - not because the problems have similar gradient scales, but because Adam normalizes them all to a similar effective step size.


Section 8: AdamW and Decoupled Weight Decay

When you add L2 regularization to a loss function optimized with Adam, you might expect the behavior to match gradient descent with L2 regularization. It does not.

Standard Adam implements weight decay by adding $\lambda w$ to the gradient before the Adam step:

$$g_t \leftarrow g_t + \lambda \mathbf{x}_t.$$

This modifies the gradient that enters the adaptive scaling. The weight decay term gets divided by $\sqrt{\hat{v}_t} + \epsilon$ just like the gradient, but $\hat{v}_t$ is estimated from the task gradient, not the regularization term. The effective regularization strength varies per parameter based on gradient history - parameters with large historical gradients receive less regularization. This is not what L2 regularization is supposed to do.

AdamW (decoupled weight decay) fixes this by applying weight decay directly to the weights, after the adaptive gradient step:

$$\mathbf{x}_{t+1} = \mathbf{x}_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} - \alpha\lambda \mathbf{x}_t.$$

The weight decay $-\alpha\lambda\mathbf{x_t}$ is not passed through the adaptive denominator. Every parameter is shrunk toward zero by the same factor $\alpha\lambda$, regardless of gradient history. This is the correct implementation of L2 regularization with Adam.

For large language models - where proper regularization is critical for generalization - this distinction matters substantially. The GPT, BERT, and most modern transformer models are trained with AdamW, not Adam. The decoupled weight decay is not a minor detail; it is part of the recipe.


Section 9: Muon - Matrix-Level Optimization

AdamW adapts learning rates per scalar parameter, treating each weight independently. Muon (Moore-Penrose Orthogonal Update Newton) treats each weight matrix as a single object and updates it via a matrix-level operation.

The key operation is the Newton-Schulz iteration, which approximates the matrix sign function. For a gradient matrix $G$, define the update:

$$W_{t+1} = W_t - \eta \cdot \text{NS}(G_t)$$

where $\text{NS}(G)$ iteratively applies $X \leftarrow 1.5X - 0.5 X X^T X$ until convergence. This has the effect of orthogonalizing the gradient - replacing $G$ with a matrix that has the same singular vectors but uniform singular values. Geometrically, Muon takes steps in the direction of the gradient but removes the scale distortion from large vs small singular values.

Why this helps. AdamW’s per-parameter adaptivity can create an axis-aligned bias - it preferentially updates parameters along coordinate axes, suppressing updates in directions that individual parameters don’t vary much. Muon’s matrix-level normalization removes this bias, allowing the optimizer to explore directions that AdamW would suppress. Empirically, Muon is more sample-efficient than AdamW at large batch sizes, where AdamW’s gradient variance reduction becomes less important.

Practical implementation. Muon requires access to the full gradient matrix before the optimizer step, which conflicts with FSDP’s sharded gradient representation. Two approaches: (1) all-to-all collectives that temporarily reassemble full gradient matrices on each rank, apply Muon, then re-shard; (2) overlapping round-robin where each rank owns a subset of matrices. The all-to-all approach requires fewer collectives and scales better.

Hybrid Muon. Several frontier models (Arcee’s Trinity Large) use Muon for hidden layers only, keeping AdamW for embeddings and the output projection. Embeddings benefit from per-parameter adaptivity (different tokens have very different update frequencies). Hidden layers capture more benefit from Muon’s matrix-level structure awareness.


Section 10: Learning Rate Schedules

The learning rate $\alpha$ is not a static choice. Different phases of training call for different values.

Warmup. Start training with a very small learning rate and ramp it up over the first few hundred or thousand steps. The motivation: at initialization, the weights are random and the gradient estimates (especially via momentum/Adam’s first moment) are poorly calibrated. Taking large steps with random gradients in the very first steps can destabilize training - large, poorly directed steps move the model to a regime where later gradients are also bad, compounding the problem.

Warmup solves this by letting the moving averages build up reliable estimates before the learning rate becomes large. Linear warmup (ramp from $\alpha_{\min}$ to $\alpha_{\max}$ over the first $T_\text{warm}$ steps) is standard for transformers. BERT uses 10,000 warmup steps; GPT-style models use 1% to 5% of total training.

Cosine annealing. After warmup, decay the learning rate following a cosine curve:

$$\alpha_t = \alpha_{\min} + \frac{1}{2}(\alpha_{\max} - \alpha_{\min})\left(1 + \cos\left(\frac{\pi (t - T_\text{warm})}{T - T_\text{warm}}\right)\right),$$

where $T$ is the total number of training steps. The learning rate decreases smoothly from $\alpha_{\max}$ to $\alpha_{\min}$ following the shape of a cosine half-period.

Cosine annealing is preferred over step decay (sudden drops) or linear decay because it is smooth, has no discontinuities, and automatically tapers to near-zero at the end of training, allowing fine-grained convergence. It is the default schedule for most large-scale training runs.

Warmup-Stable-Decay (WSD). Cosine annealing requires knowing the total training duration upfront - the decay phase is tied to the end of training. WSD decouples them: a warmup phase (1-5% of steps), a stable phase at peak learning rate, then a short decay phase (10-20% of steps) at the very end. This is more flexible: you can extend training by extending the stable phase without changing the decay schedule.

WSD tends to underperform cosine annealing during the stable phase (when the learning rate is constant, not decaying), but once it enters decay, it catches up rapidly and matches cosine annealing performance by the end. The practical advantage: if you’re running ablations at multiple token budgets, you only need to retrain the short decay phase for each budget, not the entire run.

Multi-step schedule. Discrete learning rate drops (e.g., divide by 10 at 80% and 90% of training). DeepSeek-V3 combined cosine annealing within stages with sharp drops between stages. Simpler to implement than WSD but less smooth.

Batch size warmup. Start training with a small batch size and gradually increase it. Early in training, the model makes large updates and benefits from the regularization effect of noisy small-batch gradients. As training stabilizes, larger batches provide more efficient gradient estimates. The critical batch size (below which reducing batch size doesn’t help) grows throughout training. Batch warmup exploits this: small batches early, large batches late, matching the evolving critical batch size.

Reduce on plateau. Monitor a validation metric. When it stops improving for $k$ consecutive evaluations, multiply the learning rate by a factor (typically $0.1$). This is adaptive: training proceeds at a high learning rate while progress is fast, and automatically slows when progress stalls. Common in computer vision training (ResNet on ImageNet often uses 3 decays of $\times 0.1$).

Discomfort check. The learning rate is described as a single number, but it interacts with everything else in training. Is there a “correct” learning rate?

No. The optimal learning rate depends on: batch size (doubling batch size roughly doubles the appropriate learning rate, since gradient variance halves); model architecture (the curvature of the loss landscape changes); dataset and task; the specific optimizer; and even the random seed (different random initializations have different loss landscape curvatures). This is why automatic learning rate finding methods exist: the “warmup then sweep” approach, the cyclic learning rate finder (run for a few batches while increasing $\alpha$ logarithmically; the loss starts increasing when $\alpha$ is too large, and the right value is just below that).

Community defaults exist as starting points: Adam $\alpha = 3 \times 10^{-4}$ (sometimes called the “Karpathy constant” for its ubiquity in language model training), SGD $\alpha = 0.1$ for vision. These are not magic numbers - they are empirically validated starting points that should be tuned for each specific problem.


Section 11: SGD vs Adam - the Generalization Debate

A surprising empirical fact emerged around 2018 and has not been fully explained theoretically: carefully tuned SGD with momentum often achieves better test error than Adam on image classification tasks, even when Adam achieves lower training loss faster.

Why Adam converges faster. The adaptive step sizes of Adam allow it to make progress even when the loss landscape is ill-conditioned. For different parameters, Adam effectively uses a different step size, exploiting the geometry of the loss landscape more efficiently than SGD’s uniform step size.

Why SGD sometimes generalizes better. Adam’s adaptive step sizes mean that it can take very large steps for some parameters (those with small $\hat{v}_t$). This allows Adam to find solutions in narrow, sharp valleys of the loss landscape where ordinary SGD would not venture. Sharp minima tend to be more sensitive to perturbations - the loss increases rapidly when weights change slightly. Since test data differs from training data (distribution shift), the model’s test loss is essentially the training loss at perturbed weights. Flat minima degrade gracefully; sharp minima can fail.

SGD with momentum, with its uniform step size, is less likely to enter narrow valleys. It tends to settle in wider, flatter regions of the loss landscape. Flat minima generalize better.

The practical consensus (as of 2024):

  • Transformers (GPT, BERT, T5, ViT, and all attention-based architectures): always use AdamW. SGD fails to converge reliably on attention mechanisms. The architecture requires adaptive step sizes.

  • ConvNets for image classification (ResNet, EfficientNet): carefully tuned SGD with momentum and cosine annealing often gives $0.5$-$1.5%$ better top-1 accuracy than Adam. AdamW is close and requires less tuning.

  • Fine-tuning pretrained models: use AdamW with small learning rate ($10^{-5}$ range) and warmup. The pretrained weights need gentle, consistent updates.

  • Convex problems (logistics regression, linear models): SGD with momentum converges to the true optimum with theoretical guarantees. Adam can also work, but its convergence theory is weaker outside the convex setting.

There is no universal winner. The optimizer is part of the model design decision, not an afterthought.


Section 12: Practical Starting Points

For practitioners who need defaults without extensive tuning:

Adam for deep learning:

$$\beta_1 = 0.9, \quad \beta_2 = 0.999, \quad \epsilon = 10^{-8}, \quad \alpha = 10^{-3}.$$

Use linear warmup for the first 5-10% of training steps, then cosine annealing to $\alpha_{\min} = 10^{-5}$.

AdamW for transformer fine-tuning:

$$\beta_1 = 0.9, \quad \beta_2 = 0.999, \quad \text{weight decay} = 0.01, \quad \alpha = 2\text{-}5 \times 10^{-5}.$$

Warmup for 100-500 steps, then linear or cosine decay.

SGD with momentum for vision:

$$\beta = 0.9, \quad \text{weight decay} = 10^{-4}, \quad \alpha = 0.1.$$

Cosine annealing from $0.1$ to $0$. Batch size $256$; scale $\alpha$ linearly if batch size changes.

Gradient clipping. For recurrent networks and transformers, clip the gradient norm before updating: if $|\nabla f| > c$, scale the gradient to have norm $c$. Typical $c = 1.0$ or $c = 5.0$. Gradient clipping prevents the occasional very large gradient (from a bad batch) from destroying the model.

Diagnosing training problems:

  • Loss spikes: reduce $\alpha$, add warmup, check for bad batches, add gradient clipping.
  • Loss plateaus immediately: $\alpha$ too small, or incorrect implementation of gradients.
  • Loss decreases then stalls: try learning rate schedule, check for vanishing gradients in deep layers.
  • Train loss low, validation loss high: increase weight decay, reduce model size, add dropout.
  • Optimizer diverges: $\alpha$ too large, or numerical issues (try float32 instead of float16 for debugging).

For frontier-scale training: consider WSD over cosine annealing for flexibility in extending runs; consider Muon for hidden layers if you have the infrastructure to handle all-to-all gradient collectives; use batch size warmup if training stability is a concern in early steps. These are not defaults for all settings - they require careful ablation to confirm benefit in your specific setup.

The modern optimizer zoo exists because no single optimizer is optimal for all problems. Adam is the safe default. SGD with momentum is often better for vision with careful tuning. AdamW is the standard for transformers. Understanding why each optimizer behaves as it does - what problem it was designed to solve - is what allows you to debug failures and make principled choices rather than cargo-culting hyperparameters.


Read Next: