Prerequisite:


The gradient tells you which direction to move; the Hessian tells you how the landscape is curved. Second-order information is what distinguishes a flat ridge from a narrow valley, and what allows Newton’s method to find minima in a fraction of the steps that gradient descent requires. Understanding Hessians also reveals why training deep networks is geometrically difficult.

The Hessian Matrix

For $f: \mathbb{R}^n \to \mathbb{R}$ that is twice continuously differentiable ($C^2$), the Hessian is the $n \times n$ matrix of second-order partial derivatives:

$$H_{ij}(x) = \frac{\partial^2 f}{\partial x_i \partial x_j}(x)$$

By Clairaut’s theorem, $H_{ij} = H_{ji}$ whenever $f \in C^2$, so the Hessian is always symmetric. Its eigenvalues are therefore real, and it admits an orthogonal diagonalization $H = Q \Lambda Q^T$.

Second-Order Taylor Expansion

The second-order Taylor expansion of $f$ around $x$ in direction $d \in \mathbb{R}^n$ is

$$f(x + d) = f(x) + \nabla f(x)^T d + \frac{1}{2} d^T H(x) d + O(|d|^3)$$

The quadratic form $Q(d) = d^T H d = \sum_{i,j} H_{ij} d_i d_j$ is the Hessian’s action on $d$. It measures how much the function curves in the direction $d$: if $Q(d) > 0$ the surface curves up (like a bowl), if $Q(d) < 0$ it curves down (like an inverted bowl), and if $Q$ changes sign the surface is saddle-shaped.

Quadratic bowl (positive definite H):        Saddle (indefinite H):

    f                                             f
    |    .  .  .                                  |  .       .
    |  .      .                                   | .    +    .
    |  .  * * * . (elliptical contours)           |. -    - .
    |  .      .                                   |.    ---   .
    |    .  .  .                                  | .       .
    +--------->                                   +--------->
         x                                             x

    min at center (H > 0)                   saddle at center (H indef.)

Quadratic Forms and Definiteness

A symmetric matrix $H$ is called:

  • Positive definite ($H \succ 0$): $d^T H d > 0$ for all $d \neq 0$. Equivalently, all eigenvalues $\lambda_i > 0$.
  • Positive semidefinite ($H \succeq 0$): $d^T H d \geq 0$ for all $d$. All eigenvalues $\geq 0$.
  • Negative definite ($H \prec 0$): $d^T H d < 0$ for all $d \neq 0$. All eigenvalues $< 0$.
  • Negative semidefinite ($H \preceq 0$): all eigenvalues $\leq 0$.
  • Indefinite: has both positive and negative eigenvalues.

Sylvester’s Criterion. $H \succ 0$ if and only if all leading principal minors are positive, i.e., $\det(H_{1:k, 1:k}) > 0$ for $k = 1, 2, \ldots, n$. This gives an algebraic test that avoids computing eigenvalues explicitly.

Second-Order Optimality Conditions

Theorem (Necessary conditions). If $x^\ast$ is a local minimum of $f \in C^2$, then:

  1. $\nabla f(x^\ast) = 0$ (stationarity)
  2. $H(x^\ast) \succeq 0$ (positive semidefinite)

Theorem (Sufficient conditions). If $\nabla f(x^\ast) = 0$ and $H(x^\ast) \succ 0$, then $x^\ast$ is a strict local minimum.

Proof sketch. By the second-order Taylor expansion, $f(x^\ast + d) = f(x^\ast) + \frac{1}{2} d^T H(x^\ast) d + O(|d|^3)$. Since $H(x^\ast) \succ 0$, we have $d^T H(x^\ast) d \geq \lambda_{\min}|d|^2 > 0$ for $d \neq 0$, and for sufficiently small $|d|$ the cubic remainder is dominated, giving $f(x^\ast + d) > f(x^\ast)$.

Classification at a critical point ($\nabla f(x^\ast) = 0$):

Hessian at $x^\ast$ Type of critical point
$H \succ 0$ Strict local minimum
$H \prec 0$ Strict local maximum
$H$ indefinite Saddle point
$H \succeq 0$ or $H \preceq 0$ (singular) Inconclusive; higher-order analysis needed

Newton’s Method

Newton’s method for minimizing $f$ is derived by minimizing the second-order Taylor approximation at each step. Setting the gradient of $f(x_k) + \nabla f(x_k)^T d + \frac{1}{2} d^T H(x_k) d$ to zero gives $H(x_k) d = -\nabla f(x_k)$, so the Newton step is

$$x_{k+1} = x_k - H(x_k)^{-1} \nabla f(x_k)$$

Theorem (Quadratic convergence of Newton’s method). Let $f \in C^3$ with a local minimum at $x^\ast$ where $H(x^\ast) \succ 0$. If $x_0$ is sufficiently close to $x^\ast$, then Newton’s method converges and

$$|x_{k+1} - x^\ast| \leq C |x_k - x^\ast|^2$$

for some constant $C > 0$ depending on the Lipschitz constant of the Hessian. This is quadratic convergence: the number of correct decimal digits roughly doubles at each step, compared to the linear convergence of gradient descent.

Why Newton beats gradient descent near the optimum. Gradient descent’s step size $\eta$ must be tuned to the largest eigenvalue of $H$ (to avoid divergence), making it slow along directions with small eigenvalues. Newton’s method automatically rescales by $H^{-1}$, making it invariant to linear reparametrizations: if you change coordinates by an invertible linear map $A$, Newton’s iterates (expressed in original coordinates) are unchanged. Gradient descent is not invariant.

Quasi-Newton Methods: BFGS and L-BFGS

Computing and inverting $H(x_k)$ costs $O(n^2)$ memory and $O(n^3)$ time, making exact Newton infeasible for large $n$. Quasi-Newton methods build a sequence of approximations $B_k \approx H(x_k)^{-1}$ using only gradient information.

The BFGS update (Broyden–Fletcher–Goldfarb–Shanno) maintains the approximation via

$$B_{k+1} = \left(I - \frac{s_k y_k^T}{y_k^T s_k}\right) B_k \left(I - \frac{y_k s_k^T}{y_k^T s_k}\right) + \frac{s_k s_k^T}{y_k^T s_k}$$

where $s_k = x_{k+1} - x_k$ and $y_k = \nabla f(x_{k+1}) - \nabla f(x_k)$. This is the unique symmetric rank-2 update that satisfies the secant condition $B_{k+1} y_k = s_k$ (the quasi-Newton condition) and is closest to $B_k$ in a certain matrix norm.

L-BFGS (Limited-memory BFGS) stores only the last $m$ pairs $(s_k, y_k)$ (typically $m \in [5, 20]$) and implicitly computes $B_k^{-1} \nabla f$ in $O(mn)$ time. This is the standard optimizer for large-scale smooth optimization and is widely used in scientific computing and ML (e.g., second-stage fine-tuning).

Gauss-Newton for Nonlinear Least Squares

For problems of the form $\min_x \frac{1}{2}|r(x)|^2$ with residual $r: \mathbb{R}^n \to \mathbb{R}^m$, the exact Hessian is

$$H = J_r^T J_r + \sum_i r_i(x) H_{r_i}$$

The Gauss-Newton approximation drops the second-order term (valid when residuals are small) to get $H \approx J_r^T J_r \succeq 0$. The resulting update $x_{k+1} = x_k - (J_r^T J_r)^{-1} J_r^T r(x_k)$ is exactly the least-squares normal equation step. Gauss-Newton achieves superlinear convergence when residuals at the solution are small, at lower cost than full Newton.

Examples

Why second-order methods are expensive. For a network with $n$ parameters, the Hessian has $n^2$ entries ($n$ can be $10^8$ or more). Even computing the Hessian-vector product $Hv$ via second-order autodiff takes $O(n)$ time (using reverse-over-forward AD), which is acceptable, but solving $Hd = \nabla f$ requires $O(n^2)$ memory and $O(n^3)$ time for exact methods.

Loss landscape geometry. Empirically, the loss surface of deep networks is dominated by saddle points near initialization: for random parameter vectors, the Hessian typically has many negative eigenvalues. True local minima (where $H \succ 0$) become more common as training progresses. This explains why SGD with noise is useful: stochastic gradient noise helps escape saddle points by providing random perturbations in all directions.

Sharpness and generalization. The sharpness of a minimum is measured by $\lambda_{\max}(H)$: flat minima (small $\lambda_{\max}$) tend to generalize better. Sharpness-aware minimization (SAM) explicitly seeks flat minima by perturbing parameters toward the worst-case direction $\nabla f(x + r^\ast)$ where $r^\ast = \arg\max_{|r| \leq \rho} f(x+r)$.

Hessian-free optimization. Some practical second-order methods (e.g., Martens’s Hessian-free optimization) solve $H d = -\nabla f$ approximately using conjugate gradients, computing $H v$ products via autodiff without ever forming $H$ explicitly. This brings second-order methods within reach for networks with millions of parameters.


Read Next: