Prerequisite:


The Core Idea

Most machine learning reduces to minimizing a loss function $L(\theta)$ over parameters $\theta \in \mathbb{R}^p$. When $L$ is differentiable, the gradient $\nabla_\theta L(\theta)$ points in the direction of steepest ascent, so moving opposite to it decreases the loss.

Claim. The gradient $\nabla_\theta L(\theta)$ is the direction of steepest ascent at $\theta$.

Proof. The directional derivative of $L$ at $\theta$ in direction $\mathbf{v}$ (with $|\mathbf{v}| = 1$) is $D_\mathbf{v} L(\theta) = \nabla_\theta L(\theta)^T \mathbf{v}$. By Cauchy-Schwarz, $\nabla L^T \mathbf{v} \leq |\nabla L| |\mathbf{v}| = |\nabla L|$, with equality when $\mathbf{v} = \nabla L / |\nabla L|$. So the unit vector maximizing the directional derivative is the normalized gradient. $\square$

Definition (Gradient descent update). Given step size (learning rate) $\eta > 0$:

$$\theta_{t+1} \leftarrow \theta_t - \eta \nabla_\theta L(\theta_t).$$

Learning Rate: A Delicate Choice

The learning rate $\eta$ controls how far we step in the negative gradient direction.

  • Too large: The update overshoots the minimum, and the loss can oscillate or diverge. For a quadratic $L(\theta) = \frac{1}{2}\lambda \theta^2$, the update gives $\theta_{t+1} = (1 - \eta\lambda)\theta_t$. Divergence occurs when $|1 - \eta\lambda| > 1$, i.e., $\eta > 2/\lambda$.
  • Too small: Convergence is guaranteed but slow; many iterations are needed to make meaningful progress.
  • Just right: Convergence at a rate dictated by the curvature of $L$.

For $L$-smooth functions (where $|\nabla L(\theta) - \nabla L(\phi)| \leq L|\theta - \phi|$ for all $\theta, \phi$), setting $\eta = 1/L$ gives descent with guaranteed progress at each step.

Batch, Stochastic, and Mini-Batch Gradient Descent

The empirical risk is a sum over $n$ examples: $L(\theta) = \frac{1}{n}\sum_{i=1}^n \ell_i(\theta)$. This leads to three variants:

Batch gradient descent computes the full gradient using all $n$ examples: $$\nabla L = \frac{1}{n}\sum_{i=1}^n \nabla \ell_i.$$ Expensive per step ($O(n)$ gradient evaluations), but stable and exact.

Stochastic gradient descent (SGD) samples a single example $i$ uniformly at random and uses $\nabla \ell_i$ as the gradient estimate. Cheap per step ($O(1)$), but the estimate is noisy. SGD is unbiased: $\mathbb{E}[\nabla \ell_i] = \nabla L$.

Mini-batch gradient descent averages over a batch of $B$ examples: $$g_t = \frac{1}{B}\sum_{i \in \mathcal{B}_t} \nabla \ell_i.$$ The variance of $g_t$ is $\text{Var}(\nabla \ell_i)/B$, decreasing linearly with batch size. Mini-batching balances the noise-per-step against computation, and is the de facto standard for neural network training.

Implicit regularization of SGD. SGD’s gradient noise has an important side effect: it tends to find flat minima (minima where the loss landscape has low curvature) rather than sharp ones. Flat minima often correspond to better-generalizing solutions, as small perturbations to $\theta$ don’t increase the loss much. This gives SGD a built-in regularization effect that batch GD lacks.

Momentum

Plain gradient descent can oscillate in directions with high curvature and be slow in directions with low curvature. Momentum adds a velocity term that accumulates gradients across steps:

$$v_{t+1} = \beta v_t + \nabla L(\theta_t), \quad \theta_{t+1} = \theta_t - \eta v_{t+1},$$

where $\beta \in [0, 1)$ is the momentum coefficient (typically $0.9$). Expanding the recursion:

$$v_{t+1} = \sum_{k=0}^{t} \beta^{t-k} \nabla L(\theta_k).$$

Momentum is an exponentially weighted moving average of past gradients. In directions where gradients consistently point the same way, velocity builds up; in oscillating directions, gradients cancel.

Nesterov momentum evaluates the gradient at a “lookahead” position:

$$v_{t+1} = \beta v_t + \nabla L(\theta_t - \eta \beta v_t), \quad \theta_{t+1} = \theta_t - \eta v_{t+1}.$$

The lookahead gradient is more informative - it accounts for where momentum will carry the parameters - and gives a slight acceleration in convex settings with a provably better convergence constant.

Adam

Adam (Adaptive Moment Estimation) maintains per-parameter learning rates based on first and second moment estimates of the gradient.

Let $g_t = \nabla L(\theta_t)$. Adam updates:

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t \quad \text{(first moment / mean)}$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 \quad \text{(second moment / uncentered variance)}$$

Since $m_0 = v_0 = 0$, the estimates are biased toward zero early in training. Bias correction fixes this:

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}.$$

The parameter update is then:

$$\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon},$$

where $\epsilon \approx 10^{-8}$ prevents division by zero. Typical defaults: $\beta_1 = 0.9$, $\beta_2 = 0.999$.

Adam effectively normalizes each gradient by an estimate of its standard deviation, giving larger updates to parameters with historically small gradients and smaller updates to those with large gradients.

Learning Rate Schedules

A fixed learning rate is rarely optimal. Common schedules:

  • Warm-up: Start with a small $\eta$ (e.g., $\eta_0 / 100$), linearly increase to $\eta_0$ over the first few epochs. Prevents early large steps from destabilizing a randomly initialized network.
  • Cosine decay: $\eta_t = \eta_{\min} + \frac{1}{2}(\eta_0 - \eta_{\min})(1 + \cos(\pi t / T))$. Smoothly decays $\eta$ to $\eta_{\min}$ over $T$ steps. Standard for training transformers.
  • Cyclical learning rates: Oscillate $\eta$ between $\eta_{\min}$ and $\eta_{\max}$ in cycles. Helps escape saddle points and explore flatter minima.

Convergence Theory

Convex case. For a convex $L$-smooth function, vanilla SGD achieves $\mathbb{E}[L(\theta_T)] - L(\theta^\ast) = O(1/\sqrt{T})$ with step size $\eta_t = c/\sqrt{t}$. This rate is tight in the worst case over stochastic convex optimization.

Non-convex case. For non-convex $L$ (the typical case for neural networks), convergence to a global minimum is not guaranteed. However, under standard smoothness assumptions, gradient descent converges to a critical point: a point where $|\nabla L(\theta)| < \varepsilon$ after $O(1/\varepsilon^2)$ steps. In practice, this is sufficient - empirically, critical points in neural network landscapes tend to have low loss, and saddle points, while numerous, are typically escapable by small perturbations.

Examples

Loss landscapes. A key insight about deep networks is that their loss landscapes, while non-convex, tend to be well-behaved for gradient descent. There are exponentially many saddle points (where gradient is zero but the Hessian has both positive and negative eigenvalues), but empirically they are easy to escape. True local minima that are significantly worse than the global minimum appear to be rare in overparameterized networks.

Saddle points in neural networks. Near a saddle point, gradient descent slows dramatically because the gradient is small. Momentum and noise from SGD help escape: the stochastic gradient has components pointing away from the saddle, and these are amplified by momentum over successive steps.


Read Next: