Hessians & Second-Order Methods - Curvature, Not Just Slope
Helpful context:
- Gradients & Partial Derivatives - Slopes in Every Direction at Once
- Eigenvalues & Eigenvectors - The Directions a Transformation Leaves Unchanged
You have been running gradient descent on a loss function, and the gradient finally dropped to zero. You are at a critical point - a place where $\nabla f = 0$. The question is: what did you find?
In single-variable calculus, you already know the answer: check the second derivative. If $f''(c) > 0$, you found a local minimum - the function curves upward, like a bowl. If $f''(c) < 0$, you found a local maximum - the function curves downward, like a hill. If $f''(c) = 0$, the test is inconclusive.
Now you are at a critical point of $f: \mathbb{R}^n \to \mathbb{R}$, a function of many variables. You want to run the same check. But “the second derivative” is now a matrix, not a number. The curvature now has a direction. You need to understand that matrix - the Hessian.
Section 1: Second Derivatives in One Variable - The Foundation
For $f: \mathbb{R} \to \mathbb{R}$, the second derivative $f''(x)$ measures how the slope is changing. If $f'$ is increasing, $f''$ is positive. If $f'$ is decreasing, $f''$ is negative.
Geometrically:
- $f''(x) > 0$: the graph curves upward (concave up). Near any point with $f'' > 0$, the function looks like the bottom of a bowl. Water poured here would pool.
- $f''(x) < 0$: the graph curves downward (concave down). Near any point with $f'' < 0$, the function looks like an inverted bowl. Water poured here would run off.
- $f''(x) = 0$: the curvature changes sign at an inflection point.
The second-derivative test for critical points: if $f'(c) = 0$ and $f''(c) > 0$, then $c$ is a local minimum. If $f'(c) = 0$ and $f''(c) < 0$, then $c$ is a local maximum.
Why does this work? Because near $c$, the Taylor expansion says:
$$f(c + h) \approx f(c) + f'(c) h + \frac{1}{2} f''(c) h^2 = f(c) + \frac{1}{2} f''(c) h^2$$
(since $f'(c) = 0$). If $f''(c) > 0$, then $f(c + h) > f(c)$ for all small $h \neq 0$, so $c$ is a minimum.
This is what we want to generalize. The question is: what replaces $f''(c)$ when $f$ has many inputs?
Section 2: From One Derivative to Many - Building the Hessian
Let $f: \mathbb{R}^2 \to \mathbb{R}$. At a critical point $(x_0, y_0)$, we have $\nabla f = 0$, meaning both $\frac{\partial f}{\partial x} = 0$ and $\frac{\partial f}{\partial y} = 0$.
To classify this critical point, we need to understand the curvature of $f$. But curvature now depends on direction.
In the $x$-direction: the curvature is $\frac{\partial^2 f}{\partial x^2}$. This is the second derivative of $f$ if you hold $y$ fixed and differentiate twice in $x$.
In the $y$-direction: the curvature is $\frac{\partial^2 f}{\partial y^2}$. This is the second derivative of $f$ if you hold $x$ fixed and differentiate twice in $y$.
But there is a third piece of information: the mixed partial $\frac{\partial^2 f}{\partial x \partial y}$. This measures the “coupling” or “twist” between the two variables - how the rate of change in $x$ changes as you vary $y$.
Collect all three into a matrix:
$$H = \begin{pmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{pmatrix}$$
This $2 \times 2$ matrix is the Hessian of $f$ at the critical point.
For general $f: \mathbb{R}^n \to \mathbb{R}$, the Hessian is the $n \times n$ matrix:
$$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$$
The $(i, j)$ entry is the second partial derivative: differentiate first with respect to $x_j$, then with respect to $x_i$.
The mapping to trace. In single-variable calculus, $f''(x)$ is a single number. In multivariable calculus, the Hessian $H$ is an $n \times n$ matrix. One number becomes an entire matrix. Curvature, which had one value in 1D, now has a value for every direction.
Discomfort check. In single-variable calculus, $f''(c) > 0$ directly tells you the function curves upward. So why can’t you just check the diagonal entries $f_{xx}$ and $f_{yy}$ - which measure curvature along each coordinate axis - and conclude the same thing in multiple dimensions?
Because the diagonal entries only capture curvature along the coordinate axes. The off-diagonal entries $f_{xy}$ encode coupling between variables, and this coupling can produce directions of strong curvature that the coordinate axes never reveal.
Example. Take the Hessian $$H = \begin{pmatrix} 1 & 2 \\ 2 & 1 \end{pmatrix}.$$ The diagonal says: $f_{xx} = 1 > 0$ (curves up in the $x$-direction) and $f_{yy} = 1 > 0$ (curves up in the $y$-direction). From the diagonals alone, you would conclude: positive curvature in both coordinate directions - this looks like a minimum.
But $\det(H) = 1 - 4 = -3 < 0$. The eigenvalues are $\lambda_1 = 3$ and $\lambda_2 = -1$. Moving in the direction $(1, -1)/\sqrt{2}$, the curvature is $-1$: the function curves downward there. This is a saddle point - and neither diagonal entry hinted at it.
What went wrong? The off-diagonal entry $f_{xy} = 2$ is large relative to the diagonal entries. It measures how the slope in the $x$-direction changes as you move in $y$. This coupling is strong enough to tilt the function so that the direction $(1, -1)$ - a combination of both axes - has negative curvature, even though each axis individually shows positive curvature.
Why eigenvalues solve this. An eigenvalue is not the curvature along a coordinate axis. It is the curvature along a principal direction - the direction of an eigenvector - where the Hessian acts as a pure scaling with no cross-coupling. Rotating to the eigenbasis eliminates the off-diagonal entries entirely: in that basis, the Hessian becomes diagonal with the eigenvalues on the diagonal, and each direction’s curvature is exactly its eigenvalue, with no interference from other directions.
To know whether $f$ curves upward in every direction, you need the curvature in every direction - not just the coordinate directions. The eigenvalues give you the curvature in the principal directions, and since those directions span the whole space, they settle the question completely. The single-variable case is special precisely because there is only one direction: “the diagonal entry” and “the eigenvalue” are the same thing. In multiple dimensions they diverge, and it is the eigenvalues that determine the shape.
Section 3: Schwarz’s Theorem - Why the Hessian is Symmetric
You might notice that $H$ has entries $H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$ and $H_{ji} = \frac{\partial^2 f}{\partial x_j \partial x_i}$. Are these equal?
Differentiating first with respect to $x_j$ then $x_i$ might seem different from differentiating first with respect to $x_i$ then $x_j$. For most functions you will encounter, they are equal. This fact is Schwarz’s theorem (also called Clairaut’s theorem):
Theorem. If $f$ is twice continuously differentiable (its second partial derivatives exist and are continuous), then:
$$\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$$
Consequence. The Hessian matrix $H$ is symmetric: $H_{ij} = H_{ji}$, or equivalently $H = H^T$.
This symmetry has profound consequences. A symmetric matrix has real eigenvalues and an orthogonal eigenbasis. The eigenvalues of $H$ will determine the type of critical point.
Discomfort check. Why should changing the order of differentiation give the same answer? The intuition: if $f$ is smooth, then the partial derivative $\frac{\partial f}{\partial x}$ is a well-behaved function of $y$, and differentiating it with respect to $y$ is the same as first understanding how $f$ changes with $y$ and then seeing how that change varies with $x$. For a sufficiently smooth function, these limits commute. The condition “continuous second partials” is the technical condition that makes this work. For pathological functions, mixed partials can differ, but such functions are rare in practice.
Example. $f(x, y) = x^2 y + xy^3$.
$$\frac{\partial f}{\partial x} = 2xy + y^3$$ $$\frac{\partial^2 f}{\partial y \partial x} = 2x + 3y^2$$
$$\frac{\partial f}{\partial y} = x^2 + 3xy^2$$ $$\frac{\partial^2 f}{\partial x \partial y} = 2x + 3y^2$$
Equal, as expected. The Hessian is:
$$H = \begin{pmatrix} 2y & 2x + 3y^2 \\ 2x + 3y^2 & 6xy \end{pmatrix}$$
It is symmetric.
Section 4: The Second-Order Taylor Expansion
In single-variable calculus, the Taylor expansion near a critical point $c$ is:
$$f(c + h) \approx f(c) + \underbrace{f'(c)}_{= 0} h + \frac{1}{2} f''(c) h^2 = f(c) + \frac{1}{2} f''(c) h^2$$
The quadratic term $\frac{1}{2} f''(c) h^2$ determines the local shape.
For $f: \mathbb{R}^n \to \mathbb{R}$, the analogous expansion near $x$ is:
$$f(x + \delta) \approx f(x) + \nabla f(x) \cdot \delta + \frac{1}{2} \delta^T H(x) \delta$$
The three terms correspond exactly to the 1D terms:
- $f(x)$: function value (matches $f(c)$)
- $\nabla f(x) \cdot \delta$: linear term using the gradient (matches $f'(c) h$)
- $\frac{1}{2} \delta^T H(x) \delta$: quadratic term using the Hessian (matches $\frac{1}{2} f''(c) h^2$)
At a critical point, $\nabla f = 0$, so:
$$f(x + \delta) \approx f(x) + \frac{1}{2} \delta^T H \delta$$
The Hessian quadratic form $\delta^T H \delta$ completely determines whether $f$ goes up or down in each direction $\delta$.
The mapping is explicit: $f''(c) h^2$ becomes $\delta^T H \delta$ - the scalar second derivative becomes a matrix quadratic form. Whether the quadratic form is positive for all $\delta \neq 0$ (positive definite) or negative for all $\delta$ (negative definite) determines the type of critical point.
Section 5: Positive Definite Matrices and the Bowl Shape
A symmetric matrix $H$ is called positive definite if $\delta^T H \delta > 0$ for all $\delta \neq 0$.
This is the multivariable analog of $f''(c) > 0$ in single-variable calculus.
Discomfort check. “Positive definite” is not the same as “all entries positive.” A matrix with all positive entries might not be positive definite. A matrix with some negative entries can be positive definite. What matters is the sign of the quadratic form $\delta^T H \delta$ for all nonzero $\delta$ - not the signs of the individual entries. A $2 \times 2$ example: the identity matrix $I = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}$ is positive definite ($\delta^T I \delta = \delta_1^2 + \delta_2^2 > 0$). The matrix $\begin{pmatrix} 2 & -3 \\ -3 & 5 \end{pmatrix}$ has a negative entry ($-3$) but is still positive definite.
Eigenvalue characterization. Since $H$ is symmetric, it has real eigenvalues $\lambda_1, \ldots, \lambda_n$ and an orthonormal eigenbasis $v_1, \ldots, v_n$ - $n$ eigenvectors that are mutually orthogonal and span all of $\mathbb{R}^n$. This is the Real Spectral Theorem; the proof that such a basis always exists for a symmetric matrix is the inductive argument in Eigenvalues & Eigenvectors . Write $\delta = c_1 v_1 + \cdots + c_n v_n$ in the eigenbasis. Then:
$$\delta^T H \delta = \sum_{i=1}^n \lambda_i c_i^2$$
This is a weighted sum of squares, with weights $\lambda_i$.
- If all $\lambda_i > 0$: the sum is always positive. $H$ is positive definite.
- If all $\lambda_i < 0$: the sum is always negative. $H$ is negative definite.
- If some $\lambda_i > 0$ and some $\lambda_i < 0$: the sum can be positive or negative depending on the direction. $H$ is indefinite.
Why this is the right characterization. In the eigenvector directions, the Hessian acts like a scalar curvature equal to the eigenvalue. If you move in the direction of eigenvector $v_i$, the Taylor expansion gives:
$$f(x + h v_i) \approx f(x) + \frac{1}{2} h^2 \lambda_i$$
So $\lambda_i$ is exactly the curvature of $f$ in the direction $v_i$. A positive eigenvalue means upward curvature (bowl) in that direction; a negative eigenvalue means downward curvature (hill) in that direction.
Discomfort check: how does the Hessian become diagonal in the eigenbasis?
A matrix representation records what a linear map does to each basis vector. In the standard basis ${e_1, e_2}$, the $(i,j)$ entry of $H$ records how much of $e_i$ appears in the output $He_j$. Off-diagonal entries are nonzero because $He_j$ is typically a mix of $e_1$ and $e_2$ - one axis’s output contaminates the other.
Switch to the eigenbasis ${v_1, \ldots, v_n}$. The $(i,j)$ entry now records how much of $v_i$ appears in $Hv_j$. By the definition of eigenvector: $$Hv_j = \lambda_j v_j.$$ The output lies entirely in the $v_j$ direction. There is no component of any other $v_i$ ($i \neq j$) present. So the off-diagonal entries vanish before any calculation is needed - it is an immediate consequence of what an eigenvector is.
The algebra. The $(i,j)$ entry in the eigenbasis is $v_i^T(Hv_j) = v_i^T(\lambda_j v_j) = \lambda_j(v_i \cdot v_j)$. Since $H$ is symmetric, its eigenvectors are orthogonal: $v_i \cdot v_j = 0$ for $i \neq j$. Off-diagonal entry: $\lambda_j \cdot 0 = 0$. Diagonal entry ($i = j$): $\lambda_i (v_i \cdot v_i) = \lambda_i \cdot 1 = \lambda_i$. The Hessian in the eigenbasis is $\text{diag}(\lambda_1, \ldots, \lambda_n)$.
Consequence for the quadratic form. Write any direction $\delta = \sum_i c_i v_i$. Expand: $$\delta^T H \delta = \Bigl(\sum_i c_i v_i\Bigr)^T H \Bigl(\sum_j c_j v_j\Bigr) = \sum_{i,j} c_i c_j v_i^T(Hv_j) = \sum_{i,j} c_i c_j \lambda_j (v_i \cdot v_j).$$ The orthogonality $v_i \cdot v_j = 0$ kills every cross term where $i \neq j$. Only $i = j$ terms survive: $$\delta^T H \delta = \sum_i c_i^2 \lambda_i.$$ Each eigenvector direction contributes independently, with eigenvalue $\lambda_i$ as the curvature weight. There is no interference between directions. This decoupling is exactly why eigenvalues are the right tool: they give the curvature in each principal direction with no cross-contamination from other directions.
Section 6: The Second-Derivative Test in 2D
For $f: \mathbb{R}^2 \to \mathbb{R}$ at a critical point $(x_0, y_0)$ with Hessian:
$$H = \begin{pmatrix} f_{xx} & f_{xy} \\ f_{xy} & f_{yy} \end{pmatrix}$$
(where $f_{xx} = \frac{\partial^2 f}{\partial x^2}$, $f_{xy} = \frac{\partial^2 f}{\partial x \partial y}$, $f_{yy} = \frac{\partial^2 f}{\partial y^2}$, and we used Schwarz’s theorem so $f_{xy} = f_{yx}$)
The test:
| Condition | Conclusion |
|---|---|
| $\det(H) > 0$ and $f_{xx} > 0$ | Local minimum |
| $\det(H) > 0$ and $f_{xx} < 0$ | Local maximum |
| $\det(H) < 0$ | Saddle point |
| $\det(H) = 0$ | Inconclusive |
Why this works. For a $2 \times 2$ symmetric matrix, the two eigenvalues satisfy $\lambda_1 \lambda_2 = \det(H)$ and $\lambda_1 + \lambda_2 = f_{xx} + f_{yy}$.
- $\det(H) > 0$: both eigenvalues have the same sign.
- $f_{xx} > 0$: trace is positive, so both eigenvalues are positive. $H$ is positive definite. Local minimum.
- $f_{xx} < 0$: trace is negative, so both eigenvalues are negative. $H$ is negative definite. Local maximum.
- $\det(H) < 0$: eigenvalues have opposite signs. $H$ is indefinite. Saddle point.
Example. $f(x, y) = x^3 - 3xy$.
$$f_x = 3x^2 - 3y, \quad f_y = -3x$$
Setting both to zero: $3x^2 - 3y = 0$ and $-3x = 0$, giving $x = 0$ and $y = 0$. The critical point is the origin.
$$f_{xx} = 6x, \quad f_{xy} = -3, \quad f_{yy} = 0$$
At $(0, 0)$:
$$H = \begin{pmatrix} 0 & -3 \\ -3 & 0 \end{pmatrix}$$
$\det(H) = 0 \cdot 0 - (-3)(-3) = -9 < 0$. Saddle point.
Example. $f(x, y) = x^2 + xy + y^2$.
$$f_x = 2x + y, \quad f_y = x + 2y$$
Setting both to zero: the only solution is $(0, 0)$.
$$H = \begin{pmatrix} 2 & 1 \\ 1 & 2 \end{pmatrix}$$
$\det(H) = 4 - 1 = 3 > 0$ and $f_{xx} = 2 > 0$. Local minimum.
The eigenvalues of $H$ are $\lambda = 1$ and $\lambda = 3$ (both positive), confirming positive definiteness.
Section 7: The General Case - Eigenvalues Determine Everything
For $f: \mathbb{R}^n \to \mathbb{R}$ at a critical point $x^\ast$ with Hessian $H$:
| Hessian condition | Type of critical point |
|---|---|
| All eigenvalues $> 0$ (positive definite) | Strict local minimum |
| All eigenvalues $< 0$ (negative definite) | Strict local maximum |
| Mixed positive and negative eigenvalues (indefinite) | Saddle point |
| All eigenvalues $\geq 0$ but some $= 0$ | Inconclusive |
Proof of sufficiency for minimum. Suppose all eigenvalues $\lambda_i > 0$. Let $\lambda_{\min} = \min_i \lambda_i > 0$. Then:
$$\delta^T H \delta = \sum_i \lambda_i c_i^2 \geq \lambda_{\min} \sum_i c_i^2 = \lambda_{\min} |\delta|^2$$
By the Taylor expansion:
$$f(x^\ast + \delta) \approx f(x^\ast) + \frac{1}{2} \delta^T H \delta \geq f(x^\ast) + \frac{\lambda_{\min}}{2} |\delta|^2 > f(x^\ast)$$
for small $\delta \neq 0$. So $x^\ast$ is a local minimum. $\square$
The same argument with $\lambda_{\max} < 0$ gives a local maximum.
The indefinite case: there exist directions $v_+$ (eigenvector of positive eigenvalue) where $f$ increases and directions $v_-$ (eigenvector of negative eigenvalue) where $f$ decreases. This is exactly a saddle - uphill in some directions, downhill in others.
Section 8: Newton’s Method - Using Curvature to Jump to the Minimum
Gradient descent uses only first-order information (the gradient). It is like navigating a hilly landscape by always walking downhill but with no sense of how far to step.
Newton’s method uses second-order information (the gradient and the Hessian). It approximates the landscape locally as a quadratic bowl and jumps directly to the bottom of that bowl.
The single-variable version. For $f: \mathbb{R} \to \mathbb{R}$, Newton’s method for finding a zero of $f'(x) = 0$:
$$x_{\text{new}} = x_{\text{old}} - \frac{f'(x_{\text{old}})}{f''(x_{\text{old}})}$$
This comes from the quadratic approximation: near $x$, $f(x + h) \approx f(x) + f'(x)h + \frac{1}{2}f''(x) h^2$. Minimizing over $h$: set the derivative to zero: $f'(x) + f''(x) h = 0$, giving $h = -f'(x)/f''(x)$. This is the Newton step.
The multivariable version. The same logic with the gradient and Hessian:
Near $x$: $f(x + \delta) \approx f(x) + \nabla f \cdot \delta + \frac{1}{2} \delta^T H \delta$.
Minimize over $\delta$: take the gradient with respect to $\delta$ and set to zero:
$$\nabla f + H\delta = 0 \implies \delta = -H^{-1} \nabla f$$
The Newton step is:
$$x_{\text{new}} = x_{\text{old}} - H^{-1} \nabla f$$
Compare to gradient descent: $x_{\text{new}} = x_{\text{old}} - \alpha \nabla f$.
In gradient descent, the step is $\alpha \nabla f$ - a scaled gradient, where $\alpha$ is chosen by hand.
In Newton’s method, the step is $H^{-1} \nabla f$ - the gradient premultiplied by the inverse Hessian. This automatically scales the gradient differently in each direction, according to the curvature.
The mapping: $\frac{f'(x)}{f''(x)}$ (scalar/scalar) becomes $H^{-1} \nabla f$ (matrix-vector product). The division by the second derivative becomes multiplication by the inverse of the Hessian matrix.
Section 9: Why Newton Converges Faster
Gradient descent near a minimum converges linearly: the error shrinks by a constant factor at each step. How fast depends on the condition number of the Hessian (discussed below). For badly conditioned problems, convergence is very slow.
Newton’s method near a minimum converges quadratically: if the current error is $\epsilon$, the error after one step is approximately $C\epsilon^2$ for some constant $C$. This means:
- Error of $0.1$: after one Newton step, error is approximately $0.01$.
- Error of $0.01$: after one Newton step, error is approximately $0.0001$.
The number of correct decimal digits doubles at each step. In practice, Newton’s method often reaches machine precision in 5-10 steps, regardless of the problem.
Why the quadratic convergence? The Taylor expansion has an error term of order $|\delta|^3$. Near the minimum, this cubic error is what remains after the Newton step eliminates the linear and quadratic terms. If the current distance to the minimum is $\epsilon$, the error after the Newton step is $O(\epsilon^3) / O(\epsilon^2) = O(\epsilon)$… actually $O(\epsilon^2)$ after careful analysis, because the cubic remainder is $O(\epsilon^3)$ and dividing by the Hessian (which is $O(1)$) gives $O(\epsilon^3) / \lambda_{\min} \sim \epsilon^2$ when $\epsilon$ is already small.
The cost. The Newton step requires solving $H\delta = -\nabla f$, which in general costs $O(n^3)$ time and $O(n^2)$ memory. For a neural network with $n = 10^8$ parameters, the Hessian is an $10^8 \times 10^8$ matrix with $10^{16}$ entries - completely infeasible to store or invert.
This is why gradient descent, despite slower convergence, is the default optimizer in deep learning: it costs $O(n)$ per step (just one backward pass), while Newton would cost $O(n^3)$.
Section 10: The Condition Number - Why Gradient Descent Can Fail
Suppose the Hessian at a minimum has eigenvalues $\lambda_{\min}$ and $\lambda_{\max}$ (the smallest and largest). The condition number is:
$$\kappa(H) = \frac{\lambda_{\max}}{\lambda_{\min}}$$
The condition number measures how “elliptical” the bowl is. If $\kappa = 1$, the bowl is perfectly round - same curvature in every direction. If $\kappa = 1000$, the bowl is extremely elongated - very steep in one direction, very flat in another.
Why a high condition number makes gradient descent slow. Gradient descent moves in the direction $-\nabla f$. The optimal learning rate is roughly $\alpha = 1/\lambda_{\max}$ (to avoid diverging in the steep direction). But with this learning rate, the gradient in the flat direction ($\lambda_{\min}$) barely moves: each step changes the flat coordinate by a factor of roughly $1 - \lambda_{\min}/\lambda_{\max} = 1 - 1/\kappa$.
For $\kappa = 1000$, gradient descent contracts the error by a factor of only $1 - 1/1000 = 0.999$ per step. To reduce the error by half requires about $\ln(2)/\ln(1/0.999) \approx 693$ steps.
Meanwhile, in the steep direction, the gradient is large and the same step moves you close to the minimum quickly - but the flat direction makes you take hundreds more steps.
Newton’s method is immune to this. The Newton step $H^{-1} \nabla f$ automatically accounts for the curvature in each direction. In the eigenbasis of $H$, the Newton step in direction $v_i$ is $g_i / \lambda_i$, where $g_i$ is the gradient component in that direction. The step is smaller in steep directions and larger in flat directions - exactly compensating for the curvature. Newton’s method converges in a bounded number of steps regardless of the condition number.
The practical implication. When you find training a neural network slow, often the culprit is a high condition number of the loss landscape. Learning rate schedules, momentum, and adaptive methods like Adam are all attempts to approximate the conditioning benefit of Newton’s method at lower cost.
Section 11: Quasi-Newton Methods - The Practical Compromise
Exact Newton is too expensive ($O(n^3)$ per step). Pure gradient descent is too slow (linear convergence, sensitive to condition number). Quasi-Newton methods are the compromise: approximate the Hessian inverse using gradient information accumulated over multiple steps.
The BFGS update maintains an approximation $B \approx H^{-1}$ that is updated after each step using the observed curvature. Given the step $s = x_{\text{new}} - x_{\text{old}}$ and gradient change $y = \nabla f(x_{\text{new}}) - \nabla f(x_{\text{old}})$, the update is a rank-2 correction to $B$:
$$B_{\text{new}} = \left(I - \frac{s y^T}{y^T s}\right) B \left(I - \frac{y s^T}{y^T s}\right) + \frac{s s^T}{y^T s}$$
The key property: the secant condition $B_{\text{new}} y = s$, which says the approximation correctly predicts the observed gradient change. BFGS converges superlinearly (faster than linear, slower than quadratic), making it much better than gradient descent for smooth problems.
L-BFGS (Limited-memory BFGS): instead of storing the full $n \times n$ matrix $B$ (impossible for large $n$), store only the last $m$ pairs $(s, y)$ (typically $m \in [5, 20]$). The matrix-vector product $B v$ can then be computed in $O(mn)$ time using these stored pairs, without ever forming $B$ explicitly.
L-BFGS is the standard optimizer for large-scale smooth optimization: scientific computing, classical machine learning (logistic regression, SVMs), and second-stage fine-tuning in some ML pipelines.
Section 12: The Loss Landscape and Saddle Points
In machine learning, understanding the geometry of the loss function matters for training.
At initialization. For a randomly initialized deep network, the Hessian of the loss typically has many negative eigenvalues. This means the initialization is usually near a saddle point - not a local minimum. Moving away from such saddle points is a key challenge early in training.
Gradient descent with noise (stochastic gradient descent, SGD) is helpful here: the random noise in the gradient estimate provides random perturbations in all directions, eventually pushing the iterate off a saddle and toward a descent direction.
At the end of training. After convergence, the loss is typically much smaller, and the Hessian is more likely to be positive semidefinite. True local minima become more common as training progresses.
Sharp minima vs. flat minima. The maximum eigenvalue $\lambda_{\max}(H)$ measures the “sharpness” of a minimum. A minimum with $\lambda_{\max} = 10^6$ is very sharp - a tiny change in parameters causes a large change in loss. A minimum with $\lambda_{\max} = 1$ is flat - parameters can vary considerably without much change.
Empirically, flat minima generalize better to new data. The intuition: a sharp minimum is highly specific to the training data, while a flat minimum is robust to small perturbations (including the statistical noise between training and test data).
Sharpness-aware minimization (SAM). SAM finds flat minima by explicitly penalizing sharpness: at each step, it first moves to the worst-case nearby point (the point of maximum loss in a small neighborhood), then takes a gradient step from there. This biases training toward flat regions of the loss landscape.
Section 13: The Hessian in 2D - Concrete Geometry
Let us work through the geometry carefully for $f: \mathbb{R}^2 \to \mathbb{R}$.
At a critical point, the Hessian is:
$$H = \begin{pmatrix} a & b \\ b & d \end{pmatrix}$$
The quadratic form is $\delta^T H \delta = a\delta_x^2 + 2b\delta_x\delta_y + d\delta_y^2$.
The level curves of this quadratic form are the contour lines of the local quadratic approximation to $f$. Their shape reveals the geometry:
Case 1: $H$ positive definite ($\det H > 0$, $a > 0$). The quadratic form is positive in all directions. The level curves are ellipses. The function is a bowl - higher in all directions from the critical point.
The eigenvectors of $H$ are the axes of the ellipse. The corresponding eigenvalues determine the “width” of the ellipse along each axis: a large eigenvalue means steep curvature (narrow axis), a small eigenvalue means gentle curvature (wide axis).
Case 2: $H$ negative definite ($\det H > 0$, $a < 0$). The quadratic form is negative in all directions. The level curves are still ellipses, but the function is a cap - lower in all directions.
Case 3: $H$ indefinite ($\det H < 0$). The quadratic form is positive in some directions and negative in others. The level curves are hyperbolas. The function is a saddle - rising in the eigenvector direction of the positive eigenvalue, falling in the eigenvector direction of the negative eigenvalue.
The word “saddle” is apt: a mountain pass rises in one direction (toward the mountains) and falls in the other (toward the valleys). The saddle point is a local minimum in one cross-section and a local maximum in another.
Section 14: Computing Hessians - Worked Examples
Example 1. $f(x, y) = x^4 + y^4 - 4xy$.
$$f_x = 4x^3 - 4y, \quad f_y = 4y^3 - 4x$$
Setting both to zero: $x^3 = y$ and $y^3 = x$. Substituting: $x^9 = x$, so $x(x^8 - 1) = 0$, giving $x = 0$ or $x = \pm 1$.
Critical points: $(0, 0)$, $(1, 1)$, $(-1, -1)$.
$$f_{xx} = 12x^2, \quad f_{xy} = -4, \quad f_{yy} = 12y^2$$
At $(0, 0)$:
$$H = \begin{pmatrix} 0 & -4 \\ -4 & 0 \end{pmatrix}, \quad \det H = -16 < 0$$
Saddle point.
At $(1, 1)$:
$$H = \begin{pmatrix} 12 & -4 \\ -4 & 12 \end{pmatrix}, \quad \det H = 144 - 16 = 128 > 0, \quad f_{xx} = 12 > 0$$
Local minimum. Eigenvalues: $12 \pm 4$, so $\lambda_1 = 8$ and $\lambda_2 = 16$. Condition number $\kappa = 16/8 = 2$ - a well-conditioned minimum.
At $(-1, -1)$: same Hessian as $(1, 1)$ (by symmetry). Local minimum.
Example 2. $f(x, y, z) = x^2 + 2y^2 + 3z^2 + xy$.
$$H = \begin{pmatrix} 2 & 1 & 0 \\ 1 & 4 & 0 \\ 0 & 0 & 6 \end{pmatrix}$$
This is block diagonal. The $z$-direction decouples. The eigenvalues of the top-left $2 \times 2$ block are $(2 + 4 \pm \sqrt{(4-2)^2 + 4})/2 = 3 \pm \sqrt{2}$, both positive. And $\lambda_3 = 6$. All eigenvalues positive: $H$ is positive definite. The only critical point (the origin) is a local minimum.
Section 15: Sylvester’s Criterion - Testing Positive Definiteness Without Eigenvalues
Computing eigenvalues requires solving a degree-$n$ polynomial - expensive for large matrices. Sylvester’s criterion gives an algebraic test.
Definition. The leading principal minors of $H$ are the determinants of the upper-left $k \times k$ submatrices for $k = 1, 2, \ldots, n$.
Theorem (Sylvester’s criterion). $H$ is positive definite if and only if all leading principal minors are positive.
For a $3 \times 3$ matrix:
$$H = \begin{pmatrix} a & b & c \\ b & e & f \\ c & f & g \end{pmatrix}$$
$H$ is positive definite iff:
- $a > 0$
- $\det \begin{pmatrix} a & b \\ b & e \end{pmatrix} = ae - b^2 > 0$
- $\det(H) > 0$
Example. Is $H = \begin{pmatrix} 4 & 2 & 1 \\ 2 & 3 & 1 \\ 1 & 1 & 2 \end{pmatrix}$ positive definite?
First minor: $4 > 0$. Second minor: $4 \cdot 3 - 2 \cdot 2 = 8 > 0$. Third minor: $\det(H) = 4(6 - 1) - 2(4 - 1) + 1(2 - 3) = 20 - 6 - 1 = 13 > 0$.
All positive: $H$ is positive definite.
The $2 \times 2$ case. The conditions $\det(H) > 0$ and $f_{xx} > 0$ used in the second-derivative test are exactly Sylvester’s criterion for $2 \times 2$ matrices. The first minor is $f_{xx}$, and the second minor is $f_{xx} f_{yy} - f_{xy}^2 = \det(H)$.
Section 16: The Gauss-Newton Matrix - A Practical Approximation
Many optimization problems in ML have the form:
$$\min_\theta \frac{1}{2} \sum_{i=1}^N r_i(\theta)^2 = \frac{1}{2} |r(\theta)|^2$$
where $r: \mathbb{R}^n \to \mathbb{R}^N$ is a vector of residuals (prediction errors). This is nonlinear least squares.
The exact Hessian of $\frac{1}{2}|r(\theta)|^2$ is:
$$H = J_r^T J_r + \sum_{i=1}^N r_i H_{r_i}$$
where $J_r$ is the Jacobian of $r$ and $H_{r_i}$ is the Hessian of the $i$-th residual.
The second term involves the Hessians of each residual - expensive and can be negative (making $H$ indefinite). The Gauss-Newton approximation drops this term:
$$H \approx J_r^T J_r$$
This approximation is:
- Always positive semidefinite (since $v^T J_r^T J_r v = |J_r v|^2 \geq 0$)
- Accurate when residuals are small (the dropped term $\sum r_i H_{r_i}$ is small when $r_i \approx 0$)
- Computable from only the Jacobian (no second derivatives needed)
The Gauss-Newton step is: $\theta \leftarrow \theta - (J_r^T J_r)^{-1} J_r^T r(\theta)$.
This is exactly the least-squares normal equations: the update minimizes $|r(\theta) + J_r \delta|^2$ over $\delta$, which is a linear least-squares problem.
For neural networks. The loss $L = \frac{1}{2N}\sum_i (f_\theta(x_i) - y_i)^2$ is nonlinear least squares. The Gauss-Newton matrix $J^T J / N$ (where $J$ is the Jacobian of predictions with respect to $\theta$) approximates the Hessian without computing any second derivatives. This is cheaper and always positive semidefinite.
The empirical Fisher information $\hat{F} = \frac{1}{N} \sum_i \nabla_\theta \log p_\theta(y_i|x_i) \nabla_\theta \log p_\theta(y_i|x_i)^T$ is another positive semidefinite approximation to the Hessian, related to Gauss-Newton for maximum likelihood estimation.
Section 17: Hessian-Vector Products Without Computing the Hessian
For a network with $n$ parameters, the Hessian has $n^2$ entries. Storing it requires $n^2$ memory. For $n = 10^6$, that is $10^{12}$ numbers - completely infeasible.
But many algorithms only need the product $Hv$ for a given vector $v$, not $H$ itself. This includes:
- Conjugate gradient iterations (for Newton-CG)
- Power iteration to find the largest eigenvalue
- Variance of the gradient in second-order methods
The trick: forward-over-reverse autodiff. The Hessian-vector product $Hv = \frac{d}{d\epsilon}\nabla f(x + \epsilon v)\big|_{\epsilon=0}$ can be computed as:
- Run reverse-mode autodiff (backpropagation) to compute $g(\epsilon) = \nabla f(x + \epsilon v)$.
- Differentiate $g$ with respect to $\epsilon$ using forward-mode autodiff.
Cost: two passes through the network (one forward, one backward). Memory: $O(n)$. The Hessian is never formed explicitly.
In JAX: jax.hessian(f) computes the full Hessian, while jax.grad(lambda x: jax.grad(f)(x) @ v) computes the Hessian-vector product more efficiently.
Power iteration for the largest eigenvalue. Repeatedly compute $v \leftarrow Hv / |Hv|$. Each iteration costs one Hessian-vector product: $O(n)$. After convergence, $v$ is the eigenvector of the largest eigenvalue. The convergence rate depends on the gap between the two largest eigenvalues - fast if the largest eigenvalue is isolated.
This is used to estimate the condition number and the sharpness $\lambda_{\max}$ of a minimum, which predicts generalization.
Summary
| Concept | Single-Variable Analog | Multivariable (Hessian) Version |
|---|---|---|
| Second derivative | $f''(x)$ - a number | $H$ - an $n \times n$ matrix |
| Positive curvature | $f''(c) > 0$ | $H$ positive definite (all eigenvalues $> 0$) |
| Negative curvature | $f''(c) < 0$ | $H$ negative definite (all eigenvalues $< 0$) |
| Mixed curvature | Cannot happen in 1D | $H$ indefinite (mixed eigenvalues): saddle point |
| Local minimum test | $f'(c) = 0$, $f''(c) > 0$ | $\nabla f = 0$, $H$ positive definite |
| Local maximum test | $f'(c) = 0$, $f''(c) < 0$ | $\nabla f = 0$, $H$ negative definite |
| Taylor expansion | $f(c+h) \approx f(c) + \frac{1}{2}f''(c) h^2$ | $f(x+\delta) \approx f(x) + \nabla f \cdot \delta + \frac{1}{2}\delta^T H \delta$ |
| Newton step | $-f'(x)/f''(x)$ | $-H^{-1} \nabla f$ |
| Condition number | $ | f'' |
The Hessian is not a complicated object - it is simply the matrix of all second partial derivatives. What makes it powerful is its eigenvalues: they tell you the curvature of $f$ in every direction simultaneously, and from the curvature, you can classify critical points, understand convergence rates, and design better optimization algorithms.
Read Next: