Helpful context:


You are blindfolded on a hilly landscape. You want to reach the lowest point. You can feel the slope under your feet.

Your strategy: take a small step in the direction of steepest descent. Then do it again. And again. Each step makes you a little lower than the last. Eventually - if the landscape is well-behaved and your steps aren’t too large - you reach the bottom.

This is gradient descent. It is the algorithm that trains every neural network ever built. The “landscape” is the loss surface: a function from weights to loss value. The “lowest point” is the weights that minimize the loss. The “slope under your feet” is the gradient.

Understanding gradient descent means understanding not just the algorithm but its subtleties: when does it work, why it often works when theory says it shouldn’t, and what modern optimizers do to make it work better.


The Core Update Rule

To minimize a function $L(w)$ (the loss as a function of weights $w$), gradient descent performs the update:

$$w \leftarrow w - \eta \cdot \nabla L(w)$$

Three quantities:

  • $w$: the current weights (a vector in parameter space).
  • $\nabla L(w)$: the gradient of the loss at $w$. This is a vector pointing in the direction of steepest increase of $L$.
  • $\eta > 0$: the learning rate, also called step size. Controls how far we move.

We subtract the gradient because we want to decrease $L$, not increase it. The gradient points uphill; moving opposite to it moves downhill.

Why does subtracting the gradient decrease the loss? By the first-order Taylor expansion:

$$L(w - \eta \nabla L(w)) \approx L(w) - \eta |\nabla L(w)|^2$$

For small enough $\eta$ and nonzero gradient, the right side is strictly less than $L(w)$. Every gradient descent step decreases the loss - at least locally and for sufficiently small step size.

The approximation becomes exact as $\eta \to 0$, which is why very small learning rates are guaranteed to decrease the loss at every step. Very small steps, however, means very slow convergence.


Why the Gradient Points Uphill

The gradient $\nabla L(w)$ at a point $w$ is the vector of partial derivatives:

$$\nabla L(w) = \left(\frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \ldots, \frac{\partial L}{\partial w_d}\right)$$

Its key geometric property: among all unit vectors $v$, the directional derivative $\nabla L(w) \cdot v$ is maximized when $v = \nabla L(w) / |\nabla L(w)|$. The gradient points in the direction of maximum increase of $L$.

Proof: by Cauchy-Schwarz, $\nabla L(w) \cdot v \leq |\nabla L(w)| |v| = |\nabla L(w)|$ with equality when $v \propto \nabla L(w)$.

Conversely, $-\nabla L(w)$ is the direction of maximum decrease. Moving in this direction gives the largest reduction in $L$ per unit distance, at first order. That’s why gradient descent moves opposite the gradient - it’s the locally optimal direction to walk.


The Learning Rate: The Goldilocks Problem

The learning rate $\eta$ is the most important hyperparameter in gradient descent. Getting it wrong can mean slow convergence or no convergence at all.

Too small: Each step reduces the loss by a tiny amount. Convergence happens but takes an impractical number of iterations. With $\eta = 10^{-6}$, a problem that should train in hours might take years.

Too large: The step overshoots the minimum. Imagine standing in a bowl-shaped valley: if you step too far downhill, you end up on the other side of the bowl, possibly higher than where you started. The loss can oscillate, diverge, or zigzag without converging.

Just right: Fast convergence to the minimum (or a good local minimum, for non-convex problems).

Theoretically, for a function with an $L$-Lipschitz gradient (meaning $|\nabla L(w_1) - \nabla L(w_2)| \leq L |w_1 - w_2|$), the optimal constant step size is $\eta = 1/L$. The Lipschitz constant $L$ is related to the largest eigenvalue of the Hessian - how sharply curved the loss surface is. Steeper curvature requires a smaller step.

In practice, $L$ is unknown and varies across the loss surface. Common approaches:

Constant learning rate: Pick one value and hope. Works for simple problems.

Step decay: Reduce $\eta$ by a factor (say 10) every fixed number of iterations.

Cosine annealing: $\eta_t = \frac{\eta_0}{2}(1 + \cos(\pi t / T))$. The learning rate follows a cosine curve from $\eta_0$ down to 0. Widely used in modern training runs.

Warm-up: Start with a very small learning rate, ramp up to the target over the first few thousand iterations, then decay. This avoids instability early in training when the weights are random and the loss landscape is steep. Essential for large transformers.


Convergence for Convex Functions

For convex functions, gradient descent has clean theoretical guarantees.

Definition. A function $L$ is convex if for any two points $w_1, w_2$ and $\lambda \in [0,1]$:

$$L(\lambda w_1 + (1-\lambda)w_2) \leq \lambda L(w_1) + (1-\lambda)L(w_2)$$

The function curves upward. Any local minimum is also a global minimum. Gradient descent cannot get stuck.

For a $\mu$-strongly convex function with $L$-Lipschitz gradient (both the linear regression MSE loss and the logistic regression cross-entropy loss satisfy this), gradient descent with learning rate $\eta = 1/L$ converges at rate:

$$L(w_t) - L(w^) \leq \left(1 - \frac{\mu}{L}\right)^t \left(L(w_0) - L(w^)\right)$$

This is geometric (linear) convergence: each step reduces the error by a constant factor $(1 - \mu/L) < 1$. After $t$ steps, the error is exponentially small in $t$. The number of steps to reach error $\varepsilon$ is $O((L/\mu) \log(1/\varepsilon))$.

The ratio $\kappa = L/\mu$ is the condition number. Poorly conditioned functions (large $\kappa$) converge slowly. This is why feature normalization matters: it reduces the condition number of the Hessian (makes eigenvalues more uniform), which speeds up convergence.


Non-Convex Functions: The Real World

Neural network loss surfaces are not convex. At all. They have:

Local minima: Points where the gradient is zero and all directions go uphill, but not the global minimum. Gradient descent might converge here rather than to the best possible solution.

Saddle points: Points where the gradient is zero, some directions go uphill, some go downhill. In high dimensions, most critical points are saddle points, not local minima. A saddle point in 2D looks like a mountain pass: you’re at a local minimum along one direction, but a local maximum along another.

Flat regions: Large regions where the gradient is nearly zero. Progress is very slow.

Cliffs and spikes: Regions of very sharp curvature where large gradients can cause huge, destabilizing updates. A problem especially with recurrent networks.

So why does gradient descent work in practice for training neural networks?

The honest answer: we don’t fully know. The theory is incomplete. But here are the leading explanations:

Many good local minima. In very high-dimensional spaces, local minima that aren’t global minima are rare and tend to have similar loss values to the global minimum. Empirically, different runs of gradient descent on the same network often converge to different weight configurations but similar loss values. For practical purposes, many local minima are equally good.

Saddle points are not the problem. In high dimensions, the probability of a critical point being a local minimum (all positive curvature directions) decreases exponentially with dimension. Most critical points are saddle points. And gradient descent with noise (from mini-batches) naturally escapes saddle points - because noise perturbs the trajectory in random directions, and at least one of those directions is downhill.

Overparameterization helps. Modern neural networks have far more parameters than training examples. In this regime, the loss surface has large flat regions where many different weight configurations achieve near-zero training loss. Gradient descent, even with noise, can find these regions relatively easily.

Implicit regularization. The specific path that gradient descent takes (not just the endpoint) shapes the solution. SGD implicitly prefers flatter minima with better generalization properties. This is an active research area.

Discomfort check. If the theory doesn’t fully explain gradient descent’s success on non-convex neural network loss surfaces, should you trust it? Yes - empirically, it works remarkably well across an enormous range of architectures and tasks. The lack of complete theory means we can’t predict exactly how many steps it will take or guarantee finding the global minimum, but the practical track record is overwhelming. Physics used the theory of quantum electrodynamics for decades before anyone proved it was mathematically well-defined. Sometimes what matters is that it works.


Stochastic Gradient Descent

Batch gradient descent computes the gradient over the entire training set at each step:

$$\nabla L(w) = \frac{1}{m} \sum_{i=1}^m \nabla L_i(w)$$

For a training set of $m = 10^6$ examples, one gradient computation requires $10^6$ forward and backward passes. If you need 10,000 steps to converge, that’s $10^{10}$ example-passes. Too slow.

Stochastic gradient descent (SGD) estimates the gradient using a single randomly chosen example:

$$\nabla L(w) \approx \nabla L_i(w) \quad \text{for a random } i$$

Each step is $m$ times cheaper. The estimate is unbiased (its expectation equals the true gradient), but very noisy. The noise has a useful property: it helps escape shallow local minima and saddle points that batch gradient descent might get stuck in.

SGD converges, but noisily. The loss doesn’t decrease monotonically - it bounces around as different examples give different gradient estimates. But in expectation, over many steps, the weights move toward the minimum.

Mini-batch SGD is the practical compromise: compute the gradient over a random batch of $B$ examples.

$$\nabla L(w) \approx \frac{1}{B} \sum_{i \in \text{batch}} \nabla L_i(w)$$

Typical batch sizes: $B = 32, 64, 128, 256$, up to $4096$ or more for large models on many GPUs. Larger batches give more accurate gradient estimates but require more compute per step. Smaller batches add noise but allow more steps per unit of compute.

The batch size also determines hardware utilization. GPUs are most efficient when doing large matrix multiplications in parallel - a large batch processes many examples simultaneously, maximizing GPU throughput. This is why batch size is also partly a hardware engineering choice.


Momentum

A classic problem with gradient descent on ill-conditioned problems: the loss surface has a long, narrow valley. Gradient descent zigzags across the valley rather than running along it. Each step goes in the direction perpendicular to the valley (where the gradient is large) rather than down the valley (where progress is needed).

Momentum fixes this by accumulating velocity in consistent gradient directions:

$$v_t = \beta v_{t-1} + \nabla L(w_t)$$ $$w_{t+1} = w_t - \eta v_t$$

The velocity $v_t$ is a running average of past gradients. With $\beta = 0.9$ (typical), $v_t$ is a weighted average of the last $\sim 10$ gradient estimates. Consistent gradient directions accumulate; inconsistent directions cancel. The result: the optimizer builds up speed along the valley and dampens zigzag oscillations.

Another view: momentum is like rolling a ball down a hill. The ball doesn’t stop at each step - it accumulates velocity. This makes it traverse flat regions faster and overshoot hills, which can help escape shallow local minima.


Adam: The Default Optimizer

The Adam (Adaptive Moment Estimation) optimizer, introduced by Kingma and Ba in 2014, combines momentum with per-parameter learning rates.

It maintains two moving averages:

$$m_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla L(w_t) \quad \text{(first moment: mean of gradients)}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2) (\nabla L(w_t))^2 \quad \text{(second moment: mean of squared gradients)}$$

After bias correction ($\hat{m}_t = m_t/(1-\beta_1^t)$ and $\hat{v}_t = v_t/(1-\beta_2^t)$ to account for initialization at 0), the update is:

$$w_{t+1} = w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}$$

The update for each weight is the gradient estimate ($\hat{m}_t$) divided by the root-mean-square of recent gradients ($\sqrt{\hat{v}_t}$). Parameters with consistently large gradients get a small effective step size; parameters with small, noisy gradients get a larger effective step size.

This adaptive learning rate means Adam is much less sensitive to the choice of $\eta$ than vanilla SGD. Typical defaults ($\eta = 10^{-3}$, $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\varepsilon = 10^{-8}$) work well across a very wide range of architectures and tasks.

Adam is the practical choice for most deep learning work. You start with Adam and $\eta = 10^{-3}$, and that is often already close to optimal.


Learning Rate Schedules in Practice

Even with Adam, the learning rate schedule matters significantly.

Warm-up: The first few thousand steps use a small learning rate that linearly ramps up to the target $\eta$. Why? Early in training, the weights are random and gradients are unreliable. A large initial step can push the model into a bad region of parameter space that’s hard to escape. Warm-up provides stability.

Decay: After warm-up, the learning rate decreases. Common schedules:

Cosine decay: $\eta_t = \frac{\eta_{\max}}{2}\left(1 + \cos\left(\frac{\pi t}{T}\right)\right)$ from $t = t_{\text{warm-up}}$ to $T$. The learning rate smoothly decreases from $\eta_{\max}$ to 0. Widely used for language model training.

Step decay: reduce $\eta$ by a fixed factor (e.g., multiply by 0.1) at fixed milestones.

Cyclic learning rates: periodically increase and decrease $\eta$. Can help escape local minima.

The warm-up + cosine decay schedule is standard in large-scale training of transformers (BERT, GPT, and their descendants all use variations of this).


Gradient Clipping

In recurrent networks (and sometimes transformers), gradients can become very large - not because the function has high curvature everywhere, but because the loss is extremely sensitive to certain weights in certain rare configurations. An update of $w \leftarrow w - \eta \nabla L$ can take a catastrophically large step.

Gradient clipping limits the maximum step size. Before each update, if $|\nabla L| > C$ for some threshold $C$, scale the gradient down: $\nabla L \leftarrow C \cdot \nabla L / |\nabla L|$.

This caps the magnitude of the update while preserving its direction. It’s a simple fix for the “exploding gradient” problem that doesn’t require changing the model architecture or loss function.


Summary

Method Update Rule When to Use
Batch GD $w \leftarrow w - \eta \nabla L(w)$ Small datasets; convex problems
SGD $w \leftarrow w - \eta \nabla L_i(w)$ Online learning; very large datasets
Mini-batch SGD Average gradient over batch $B$ Default for most practical work
Momentum Add velocity term $\beta v_{t-1}$ Ill-conditioned problems; helps escape saddle points
Adam Adaptive per-parameter $\eta$ Default for deep learning; rarely needs tuning
Concept Key fact
Why $-\nabla L$? Gradient points uphill; subtract to go downhill
Learning rate Too small = slow; too large = diverges; schedule matters
Convex convergence Geometric rate $(1 - \mu/L)^t$
Non-convex reality Theory incomplete; SGD works empirically
Batch size $B$ Variance-compute trade-off; hardware-dependent
Gradient clipping Cap $|\nabla L|$ to prevent explosive updates

Gradient descent is the engine of deep learning. Its mathematics is simple: move opposite the gradient, scaled by a step size. Its practice is more nuanced: choosing the right optimizer, learning rate schedule, and batch size makes the difference between a model that trains in hours and one that never converges.

The next question is: how do we compute the gradient efficiently? For a network with billions of parameters, computing $\nabla L$ by hand is impossible. That’s where automatic differentiation comes in.


Read next: