BFGS - When First-Order Gradients Aren't Fast Enough // Megha Bose

Helpful context:

Gradient descent is like descending a mountain with your eyes closed - you feel the slope beneath your feet and step in the steepest downhill direction. Newton’s method opens your eyes: it sees not just the slope but the curvature - whether the valley ahead narrows sharply or widens gradually - and adjusts its step accordingly. In a quadratic bowl, Newton’s method reaches the bottom in exactly one step.

BFGS is gradient descent for people who want Newton’s intelligence but can’t afford to compute the Hessian. It approximates Newton’s method using only gradient information, building up a picture of the curvature incrementally over many steps. It is the workhorse optimizer behind almost all serious non-linear optimization in scientific computing - and L-BFGS, its memory-efficient variant, is what scipy.optimize, PyTorch’s LBFGS, and most large-scale numerical solvers use in practice.

Newton’s Method for Optimization

Newton’s method approximates $f$ near the current point $x_k$ with a quadratic:

$$f(x) \approx f(x_k) + \nabla f(x_k)^T (x - x_k) + \frac{1}{2}(x - x_k)^T \nabla^2 f(x_k)(x - x_k).$$

Minimizing this quadratic gives the Newton step:

$$x_{k+1} = x_k - \left[\nabla^2 f(x_k)\right]^{-1} \nabla f(x_k).$$

The step direction $d_k = -H^{-1} \nabla f(x_k)$ (where $H = \nabla^2 f$) rescales the gradient by the inverse Hessian, accounting for the curvature. In a long narrow valley - where gradient descent zigzags inefficiently - Newton’s method steps across the narrow direction, not along it.

Convergence. Newton’s method has quadratic convergence near the optimum: if $|x_k - x^| \leq \varepsilon$, then $|x_{k+1} - x^| \leq C \varepsilon^2$ for some constant $C$. The number of correct digits doubles at each iteration. A handful of Newton steps near the solution beats thousands of gradient descent steps.

The problem. Computing the Hessian requires $O(n^2)$ storage and $O(n^2)$ gradient evaluations. Inverting it (or solving the linear system $H d = -\nabla f$) costs $O(n^3)$. For $n = 10^4$ parameters, that’s already borderline. For $n = 10^6$, it’s completely infeasible.

Quasi-Newton and the Secant Condition

Quasi-Newton methods replace the exact Hessian with an approximation $B_k$ (or its inverse $H_k = B_k^{-1}$), updated cheaply at each step from gradient information alone.

Define the curvature pair at step $k$:

$$s_k = x_{k+1} - x_k, \qquad y_k = \nabla f(x_{k+1}) - \nabla f(x_k).$$

By the mean value theorem, $y_k \approx \nabla^2 f \cdot s_k$ - the gradient difference approximates the Hessian applied to the step direction. A good Hessian approximation should satisfy the secant condition:

$$B_{k+1} s_k = y_k.$$

This says: the new approximation should reproduce the observed gradient change exactly. There are many matrices satisfying this condition - BFGS picks the one that satisfies additional desirable properties.

The BFGS Update

BFGS finds the update $H_{k+1}$ (inverse Hessian approximation) that:

Satisfies the inverse secant condition: $H_{k+1} y_k = s_k$
Remains symmetric positive definite
Is closest to $H_k$ in a weighted Frobenius norm

The solution is the rank-2 update:

$$H_{k+1} = \left(I - \rho_k s_k y_k^T\right) H_k \left(I - \rho_k y_k s_k^T\right) + \rho_k s_k s_k^T,$$

where $\rho_k = \frac{1}{y_k^T s_k}$. This is a rank-2 update: the new matrix differs from the old by a rank-2 correction, computable in $O(n^2)$ time.

The search direction at step $k$ is $d_k = -H_k \nabla f(x_k)$, and we choose the step size $\alpha_k$ via a line search satisfying the Wolfe conditions (sufficient decrease and curvature condition). The Wolfe conditions ensure $y_k^T s_k > 0$, which guarantees $H_{k+1}$ remains positive definite.

Discomfort check. The BFGS update maintains positive definiteness via the Wolfe line search. But isn’t an approximation to $H^{-1}$ from rank-2 updates necessarily crude?

Surprisingly, no. BFGS accumulates curvature information across many steps. Each update incorporates information about the Hessian in the direction $s_k$ and $y_k$. After $n$ steps in an $n$-dimensional space, the approximation has absorbed information from $n$ independent directions and can represent the full Hessian accurately. For quadratic functions, BFGS converges in exactly $n$ steps (like conjugate gradient). For smooth nonlinear functions, BFGS achieves superlinear convergence near the optimum - faster than linear, slower than quadratic Newton.

L-BFGS: The Practical Algorithm

Full BFGS stores the $n \times n$ matrix $H_k$. For $n = 10^6$ parameters, this requires $4 \times 10^{12}$ bytes - completely infeasible. L-BFGS (Limited-memory BFGS) avoids storing $H_k$ explicitly.

Instead, L-BFGS stores only the last $m$ curvature pairs $\{(s_i, y_i)\}_{i=k-m}^{k-1}$, using $O(mn)$ memory with $m$ typically 5 - 20. It never forms $H_k$ explicitly - instead, it computes the product $H_k \nabla f(x_k)$ implicitly using a two-loop recursion:

q ← ∇f(x_k)
for i = k-1, ..., k-m:
    α_i ← ρ_i · (s_i^T q)
    q   ← q - α_i y_i
r ← H_0 q       (initial Hessian scaling: H_0 = (s^T y / y^T y) I)
for i = k-m, ..., k-1:
    β  ← ρ_i · (y_i^T r)
    r  ← r + (α_i - β) s_i
return r        (this is H_k ∇f(x_k))

Two passes through $m$ stored vectors, $O(mn)$ per iteration. This is what scipy.optimize.minimize with method='L-BFGS-B' runs, what PyTorch’s torch.optim.LBFGS runs, and what most large-scale scientific optimization uses.

The B in L-BFGS-B stands for bound-constrained: simple box constraints $\ell \leq x \leq u$ are handled via projected gradient steps. This covers a huge range of practical problems.

Discomfort check. Gradient descent works fine in deep learning. Why don’t people use L-BFGS more for neural networks?

Deep learning uses mini-batches: at each step, you compute the gradient on a random subset of the data, not the full dataset. The gradient you see is a noisy estimate of the true gradient. L-BFGS requires accurate, consistent gradient information to build a reliable Hessian approximation - the curvature pair $(s_k, y_k)$ should reflect the true function’s curvature, not the noise-corrupted mini-batch estimate. With stochastic gradients, the secant condition $B_{k+1} s_k = y_k$ incorporates noise, and the approximation degrades quickly.

L-BFGS shines for full-batch problems where you can compute the gradient exactly: fitting logistic regression or CRFs, physics simulation, image reconstruction, quantum chemistry, fine-tuning small models on fixed datasets. In these settings, L-BFGS typically converges in 20 - 100 iterations on problems that take gradient descent thousands of iterations to solve.

Convergence Summary

Near a strongly convex minimum with Lipschitz-continuous Hessian:

Gradient descent: linear convergence, rate $(1 - 1/\kappa)$ per step, where $\kappa = \lambda_{\max}/\lambda_{\min}$ is the condition number. Slow on ill-conditioned problems.
BFGS/L-BFGS: superlinear convergence (faster than any geometric rate). Effectively condition-number-independent near the optimum.
Newton: quadratic convergence. Error in digits doubles each iteration.

Farther from the optimum:

Newton can diverge if the Hessian is indefinite (non-convex regions).
BFGS with Wolfe line search is globally convergent for smooth functions.
Gradient descent is the most robust - converges from anywhere with appropriate step size.

Applications

Scientific computing. L-BFGS is the default optimizer in SciPy, MATLAB’s fminunc, and most physics simulation packages. Problems like fitting the parameters of a physical model to experimental data are typically smooth, convex or mildly non-convex, and small enough for full-batch gradients - exactly where L-BFGS excels.

Quantum chemistry. Geometry optimization of molecular structures (finding the arrangement of atoms minimizing total energy) is a standard L-BFGS application. The energy function is smooth and the gradients (atomic forces) can be computed exactly.

Image reconstruction. Denoising, super-resolution, and tomographic reconstruction often involve minimizing a smooth regularized loss over pixel values - a large but well-structured problem for L-BFGS-B.

Statistical model fitting. Logistic regression, CRFs, Gaussian process hyperparameter optimization - any smooth log-likelihood maximization problem with exact gradient computation. sklearn’s LogisticRegression(solver='lbfgs') uses L-BFGS as the default.

Summary

Method	Per-step cost	Memory	Convergence rate	Best for
Gradient descent	$O(n)$	$O(n)$	Linear ($1 - 1/\kappa$ per step)	Stochastic, very large $n$, non-smooth
L-BFGS	$O(mn)$	$O(mn)$	Superlinear (near optimum)	Full-batch, smooth, $n \leq 10^7$
BFGS	$O(n^2)$	$O(n^2)$	Superlinear	Full-batch, smooth, $n \leq 10^4$
Newton	$O(n^3)$	$O(n^2)$	Quadratic	Exact Hessian available, $n \leq 10^3$

BFGS is Newton’s method for the real world: you don’t have the Hessian, but you can build a good approximation from the gradients you’ve already computed, at a cost of $O(n^2)$ per step. L-BFGS extends this to large-scale problems by dropping the full matrix and keeping only the last $m$ curvature pairs. The result is an optimizer that converges dramatically faster than gradient descent on smooth problems, with only a modest overhead per step, and is robust enough to be the default in most scientific computing stacks.

Read Next: