Prerequisite: Gradients

Overview

First-order methods like gradient descent use only $\nabla f(x)$ and can take thousands of iterations on ill-conditioned problems. Second-order methods incorporate curvature information - the Hessian $\nabla^2 f(x)$ - to take much larger, better-directed steps. BFGS (Broyden-Fletcher-Goldfarb-Shanno) is the dominant quasi-Newton method: it achieves superlinear convergence by building up an approximation to the inverse Hessian from the gradient differences observed along the optimization trajectory, without ever differentiating twice.

Newton’s Method

Newton’s method applies a quadratic approximation of $f$ around the current point $x_k$:

$$f(x) \approx f(x_k) + \nabla f(x_k)^T (x - x_k) + \frac{1}{2}(x - x_k)^T \nabla^2 f(x_k)(x - x_k)$$

Minimising this quadratic gives the Newton step:

$$x_{k+1} = x_k - \left[\nabla^2 f(x_k)\right]^{-1} \nabla f(x_k)$$

Newton’s method has quadratic convergence near a minimum: if $|x_k - x^\ast| < \epsilon$, then $|x_{k+1} - x^\ast| = O(\epsilon^2)$. The error halves in number of digits each iteration. However, computing and inverting the Hessian costs $O(n^2)$ storage and $O(n^3)$ time per step - prohibitive for $n \gtrsim 10^4$.

Quasi-Newton and the Secant Condition

Quasi-Newton methods replace $\nabla^2 f(x_k)$ with an approximation $B_k$ (Hessian approximation) or $H_k = B_k^{-1}$ (inverse Hessian approximation), updated cheaply at each step using the gradient difference.

Define: $$s_k = x_{k+1} - x_k, \qquad y_k = \nabla f_{k+1} - \nabla f_k$$

By the mean-value theorem, $y_k \approx \nabla^2 f \cdot s_k$, so a good approximation should satisfy the secant condition $B_{k+1} s_k = y_k$ (or equivalently $H_{k+1} y_k = s_k$).

The BFGS Update

BFGS finds the update $H_{k+1}$ that satisfies the secant condition, remains symmetric positive definite, and is closest to $H_k$ in a weighted Frobenius norm. The solution is:

$$H_{k+1} = \left(I - \rho_k s_k y_k^T\right) H_k \left(I - \rho_k y_k s_k^T\right) + \rho_k s_k s_k^T$$

where $\rho_k = \frac{1}{y_k^T s_k}$. This is a rank-2 update that can be computed in $O(n^2)$ time.

Equivalently, the Hessian approximation update (via Sherman-Morrison-Woodbury) is:

$$B_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}$$

The descent step is then $x_{k+1} = x_k - \alpha_k H_k \nabla f_k$, where $\alpha_k$ is chosen by a line search satisfying the Wolfe conditions (see below). The positive-definiteness of $H_{k+1}$ is guaranteed by the curvature condition $y_k^T s_k > 0$, which is enforced by the Wolfe conditions.

A line search that simply minimises $f(x_k + \alpha d_k)$ exactly (Armijo) is too expensive. The Wolfe conditions provide a practical stopping criterion:

Sufficient decrease (Armijo): $$f(x_k + \alpha_k d_k) \leq f(x_k) + c_1 \alpha_k \nabla f_k^T d_k$$

Curvature condition: $$\nabla f(x_k + \alpha_k d_k)^T d_k \geq c_2 \nabla f_k^T d_k$$

with $0 < c_1 < c_2 < 1$ (typically $c_1 = 10^{-4}$, $c_2 = 0.9$). The curvature condition prevents steps so short that $y_k^T s_k$ becomes negative or zero, preserving positive-definiteness of $H_k$.

Convergence

  • BFGS: Superlinear convergence on strongly convex functions - $|x_k - x^\ast|$ goes to zero faster than any geometric sequence, though not as fast as quadratic Newton.
  • SR1 (Symmetric Rank-1): An alternative single rank-1 update that does not guarantee positive-definiteness but can approximate indefinite Hessians, useful when $f$ is non-convex.
  • Linear convergence: First-order methods converge at rate $(1 - 1/\kappa)^k$ where $\kappa = \lambda_{\max}/\lambda_{\min}$ is the condition number. BFGS effectively rescales the geometry, making convergence nearly independent of $\kappa$ near the optimum.

L-BFGS: Limited-Memory BFGS

Full BFGS stores the $n \times n$ matrix $H_k$, which requires $O(n^2)$ memory - infeasible for $n \sim 10^6$ (deep learning, PDE-constrained optimization). L-BFGS stores only the last $m$ curvature pairs ${(s_i, y_i)}_{i=k-m}^{k-1}$ and reconstructs the product $H_k \nabla f_k$ implicitly using the two-loop recursion:

q ← ∇f_k
for i = k-1, ..., k-m:
    α_i ← ρ_i · s_i^T q
    q   ← q - α_i y_i
r ← H_0 q   (typically H_0 = (s_{k-1}^T y_{k-1} / y_{k-1}^T y_{k-1}) I)
for i = k-m, ..., k-1:
    β ← ρ_i · y_i^T r
    r ← r + (α_i - β) s_i
return r   (= H_k ∇f_k)

This costs $O(mn)$ per iteration instead of $O(n^2)$, with $m$ typically 5–20. L-BFGS convergence is linear (not superlinear) in general, but in practice it converges in very few iterations compared to gradient descent on the same problem.

L-BFGS-B extends L-BFGS to handle simple bound constraints $\ell \leq x \leq u$ via projected gradient steps, making it the workhorse for constrained scientific computing.

Practical Comparison

Method Cost/iter Memory Convergence Use when
Gradient descent $O(n)$ $O(n)$ Linear $n$ huge, non-smooth
L-BFGS $O(mn)$ $O(mn)$ Linear (fast) $n \leq 10^7$, smooth
BFGS $O(n^2)$ $O(n^2)$ Superlinear $n \leq 10^4$, smooth
Newton $O(n^3)$ $O(n^2)$ Quadratic $n \leq 10^3$, exact Hessian

Examples

Training logistic regression and CRFs. The negative log-likelihood of a logistic regression or conditional random field is smooth and strongly convex. L-BFGS typically converges in 20–100 iterations regardless of problem dimension. sklearn’s LogisticRegression(solver='lbfgs') and Stanford’s CRF++ both use L-BFGS as the default.

scipy.optimize.minimize. scipy.optimize.minimize(f, x0, method='L-BFGS-B', jac=grad_f) provides a production-quality L-BFGS-B implementation. It handles bound constraints, uses a Wolfe line search, and reports convergence diagnostics. The same interface gives access to BFGS, Newton-CG, and trust-region methods for comparison.

Hyperparameter optimisation. When optimising smooth acquisition functions in Bayesian optimisation (e.g., Expected Improvement), L-BFGS with multiple random restarts is the standard inner optimizer - fast convergence matters because this optimization runs inside an outer loop.


Read Next: Modern Optimizers