Numerical Methods - When Exact Answers Cost Too Much to Compute
Helpful context:
- Gradients & Partial Derivatives - Slopes in Every Direction at Once
- Hessians & Second-Order Methods - Curvature, Not Just Slope
- Integration - Summing Infinitely Many Infinitely Small Pieces
You write code to solve a linear system $Ax = b$. You run it. The answer is wildly wrong. You check your math: it is correct. You check your code: it looks right. What happened?
Here is a specific case. Let $A$ be the $2 \times 2$ matrix:
$$A = \begin{pmatrix} 1.000 & 0.999 \\ 0.999 & 0.998 \end{pmatrix}, \quad b = \begin{pmatrix} 1.999 \\ 1.997 \end{pmatrix}$$
The exact solution is $x = (1, 1)^T$. Now perturb $b$ very slightly:
$$b' = \begin{pmatrix} 2.000 \\ 1.997 \end{pmatrix}$$
Solve again. The solution becomes $x' = (1001, -999)^T$. A perturbation of $0.001$ in $b$ caused a change of $1000$ in $x$.
Your code is correct. Your math is correct. The problem is the matrix itself. And no algorithm - regardless of how clever - can produce a reliable answer, because the relationship between $b$ and $x$ is inherently unstable.
Understanding when this happens, why it happens, and what to do about it: that is numerical methods and stability.
Section 1: Floating-Point Arithmetic
Before we can discuss what goes wrong, we need to understand the medium in which computation happens.
Computers do not compute with real numbers. They compute with a finite set of numbers that approximate the reals, called floating-point numbers. The IEEE 754 standard (used by virtually all modern hardware) represents each number in the form:
$$\pm 1.b_1 b_2 b_3 \cdots b_{52} \times 2^e$$
where $b_1, \ldots, b_{52}$ are bits (0 or 1) and $e$ is an exponent stored in 11 bits. The number $52$ is fixed. So the number of significant binary digits is always 52 (for double precision). That is a finite amount of information.
Two immediate consequences:
Consequence 1: Most real numbers cannot be represented exactly. The decimal $0.1$ cannot be represented exactly in binary floating point (just as $1/3$ cannot be written as a terminating decimal). When you write x = 0.1 in any programming language, the variable $x$ holds a number very close to $0.1$ but not equal to it.
Consequence 2: Arithmetic operations introduce rounding error. When you compute $a + b$, the true result may not be exactly representable, so the computer returns the nearest representable number. This introduces a tiny error at each operation.
The machine epsilon $\varepsilon_{\text{mach}}$ formalizes how small “tiny” is:
$$\varepsilon_{\text{mach}} = 2^{-52} \approx 2.2 \times 10^{-16} \quad \text{(double precision)}$$
This is the smallest number such that the computer distinguishes $1 + \varepsilon_{\text{mach}}$ from $1$. Any number smaller than $\varepsilon_{\text{mach}}$, when added to 1, produces 1 in floating point.
Each individual rounding error is tiny - about $10^{-16}$. The danger is that some algorithms amplify these errors enormously.
Discomfort check. You might think: errors of $10^{-16}$ are so small that they cannot matter. This intuition fails. When you chain many operations, or when an algorithm’s structure causes errors to multiply rather than cancel, tiny errors can grow to dominate the result. Floating-point arithmetic is not merely “approximately correct” - in the worst case, it can give answers that are wrong by orders of magnitude. The field of numerical analysis exists precisely to identify when this happens.
Section 2: What Floating-Point Addition Actually Does
Before catastrophic cancellation, it is worth seeing one more concrete example of how floating-point arithmetic differs from real arithmetic.
In real arithmetic: $1 + 10^{-17} \neq 1$. The sum has a definite value.
In double-precision floating point: 1.0 + 1e-17 == 1.0. The result is exactly 1. The number $10^{-17}$ is smaller than $\varepsilon_{\text{mach}} \approx 2.2 \times 10^{-16}$, so when you add it to 1, the result rounds back to 1.
This means that a sequence of operations can lose contributions entirely. Consider computing $\sum_{i=1}^{10^{20}} 10^{-20}$ by adding $10^{-20}$ repeatedly to a running sum initialized at 0. The true answer is 1. But each addition of $10^{-20}$ to a running sum that has already grown past 1 contributes nothing - $10^{-20}$ is below the representable precision at that scale.
The correct approach is pairwise summation or Kahan summation, which compensate for the lost low-order bits by tracking a “compensation” term separately. This is not exotic - it is the method used in well-implemented numerical libraries for summation.
Section 3: Catastrophic Cancellation
The most common cause of large numerical errors has a name: catastrophic cancellation. It happens when you subtract two nearly equal quantities.
Here is why it destroys precision. Suppose you compute $a - b$ where $a = 1234.5678$ and $b = 1234.5677$. Both numbers have 9 significant digits. The difference is $0.0001$ - but that has only 1 significant digit. By subtracting, you destroyed 8 significant digits.
The relative error in the result can be enormous. If $a$ and $b$ each have a relative error of $\varepsilon_{\text{mach}} \approx 10^{-16}$, the absolute error in each is about $1234 \times 10^{-16} \approx 10^{-13}$. The absolute error in $a - b$ is at most $2 \times 10^{-13}$. But $a - b = 0.0001$. The relative error in $a - b$ is:
$$\frac{2 \times 10^{-13}}{0.0001} = 2 \times 10^{-9}$$
Starting from 16 digits of precision, we ended with 9. We lost 7 digits. If $a$ and $b$ agreed to more decimal places, we could lose all of them.
A concrete example. Consider the function $f(x) = \sqrt{x + 1} - \sqrt{x}$ for large $x$.
For $x = 10^8$: $\sqrt{10^8 + 1} \approx 10000.00005$ and $\sqrt{10^8} = 10000.00000$. Both have 14 digits before the interesting part. The difference $0.00005$ has only about 4 significant digits left, even though the true answer has many more digits of precision.
The fix is algebraic. Multiply and divide by the conjugate:
$$f(x) = \sqrt{x+1} - \sqrt{x} = \frac{(\sqrt{x+1} - \sqrt{x})(\sqrt{x+1} + \sqrt{x})}{\sqrt{x+1} + \sqrt{x}} = \frac{1}{\sqrt{x+1} + \sqrt{x}}$$
The rewritten formula adds two large numbers rather than subtracting them - no cancellation occurs. For $x = 10^8$, this gives $1/20000 = 0.00005$ with full precision.
The algebraic identity is not just a mathematical curiosity. It is a numerical cure. Rewriting formulas to avoid cancellation is one of the core skills of numerical computing.
Section 4: Condition Numbers
Return to the opening example. Why does a tiny perturbation in $b$ produce a huge change in the solution to $Ax = b$?
The answer is the condition number of $A$. This is a property of the problem, not of the algorithm.
Definition. The condition number of a matrix $A$ is:
$$\kappa(A) = |A| \cdot |A^{-1}|$$
where $|\cdot|$ is the matrix norm induced by the vector norm (the largest factor by which $A$ can stretch a vector). For symmetric matrices, $\kappa(A) = |\lambda_{\max}| / |\lambda_{\min}|$ - the ratio of the largest to the smallest eigenvalue in absolute value. For general matrices, it equals $\sigma_{\max} / \sigma_{\min}$, the ratio of the largest to smallest singular value.
What the condition number measures. If you perturb $b$ by a relative amount $\varepsilon$ (so $|b' - b| / |b| = \varepsilon$), the relative change in the solution satisfies:
$$\frac{|x' - x|}{|x|} \leq \kappa(A) \cdot \varepsilon$$
The condition number amplifies the relative error. If $\kappa(A) = 10^6$ and your input has 16 significant digits, your output has at most 10 significant digits. If $\kappa(A) = 10^{16}$, you have zero reliable digits - the solution is garbage, regardless of which algorithm you use.
For the matrix in the opening example, $\kappa(A) \approx 4 \times 10^6$. A perturbation of $10^{-3}$ in $b$ can produce a change of $10^{-3} \times 4 \times 10^6 = 4000$ in $x$. This is what we observed.
Discomfort check. The condition number is a property of the problem, not of your algorithm. If a problem is ill-conditioned (large $\kappa(A)$), no numerical algorithm can reliably solve it without additional information or reformulation. You cannot “fix” an ill-conditioned problem by switching from one solver to another. The only cures are: getting better data (reducing errors in $b$), reformulating the problem (changing $A$), or using more precision (which helps only if $\kappa(A) \cdot \varepsilon_{\text{mach}}$ is still small).
The condition number of a scalar. For a scalar problem $ax = b$, the solution is $x = b/a$ and the condition number is $|a| \cdot |a^{-1}| = 1$. Scalar multiplication is perfectly conditioned. The ill-conditioning in linear systems comes from the interaction between the rows of $A$.
Section 5: Forward and Backward Error Analysis
Before discussing what makes an algorithm stable or unstable, we need a framework for measuring what “wrong” means.
Forward error. Given an algorithm that computes an approximation $\hat{x}$ to the true answer $x^$, the forward error is simply the difference $|\hat{x} - x^|$. This is what you care about directly: how far off is the answer?
The problem with forward error analysis is that bounding it requires knowing the exact answer $x^*$, which defeats the purpose.
Backward error. John Wilkinson (1919-1986) proposed a more powerful approach. Instead of asking “how far is $\hat{x}$ from the true answer?”, ask: “for what input would $\hat{x}$ be the exact answer?”
More precisely: you want to compute $f(x)$ and your algorithm produces $\hat{y}$. The backward error is the smallest perturbation $\delta x$ such that $\hat{y} = f(x + \delta x)$ exactly. If the backward error $|\delta x| / |x|$ is small - say, of order $\varepsilon_{\text{mach}}$ - then the algorithm is producing the exact answer to a very slightly different problem. This is backward stability.
An algorithm is backward stable if the computed answer is the exact answer to a problem that is close to the original problem. “Close” means “within machine precision.”
Why backward stability is the right standard. If an algorithm is backward stable and the problem is well-conditioned ($\kappa$ small), then the forward error is also small:
$$\text{forward error} \leq \kappa \times \text{backward error} \approx \kappa \varepsilon_{\text{mach}}$$
You cannot do better than this: you cannot distinguish inputs that differ by less than machine precision, so forward errors of order $\kappa \varepsilon_{\text{mach}}$ are the best any algorithm can achieve for a problem with condition number $\kappa$.
If an algorithm is not backward stable, it introduces additional errors beyond those forced by the condition number - it wastes accuracy.
Gaussian elimination with partial pivoting is backward stable. Gaussian elimination without pivoting is not (as Section 6 illustrates). Householder QR decomposition is backward stable without any pivoting. These are theorems, not empirical observations, proved by tracking how rounding errors accumulate through the algorithm.
Section 6: Stable vs. Unstable Algorithms
Here is a subtlety. Two algorithms for the same problem can have very different behavior, even if the problem is well-conditioned.
Consider Gaussian elimination - the standard method for solving $Ax = b$ by reducing $A$ to upper triangular form. The algorithm works by adding multiples of one equation to another to eliminate variables.
Gaussian elimination without pivoting. At each step, divide by the current diagonal entry (the pivot). If a pivot is small, the multiplier - the ratio used to eliminate the variable below - is large. Large multipliers amplify rounding errors in subsequent steps.
Example: solve the system with $A = \begin{pmatrix} 0.0001 & 1 \\ 1 & 1 \end{pmatrix}$, $b = \begin{pmatrix} 1 \\ 2 \end{pmatrix}$. The exact solution is approximately $x = (1.0001, 0.9999)$. Without pivoting, the first multiplier is $1 / 0.0001 = 10000$. This large multiplier can cause the accumulated errors to dominate the answer.
Gaussian elimination with partial pivoting. Before eliminating in column $j$, swap rows so that the largest entry in that column is the pivot. This keeps all multipliers at most 1 in absolute value, preventing error amplification.
The formal statement: Gaussian elimination with partial pivoting is backward stable. This means the computed solution $\hat{x}$ is the exact solution to a perturbed problem $(A + \delta A)\hat{x} = b$ where $|\delta A| / |A|$ is small (of order $\varepsilon_{\text{mach}}$ times a manageable factor). In practice, partial pivoting is stable for virtually all matrices that arise in applications.
The distinction between the condition number of a problem and the stability of an algorithm is crucial:
- Condition number: property of the problem. Determines the best accuracy any algorithm can achieve.
- Stability of algorithm: property of the method. Determines whether the algorithm achieves the best possible accuracy, or wastes it.
A well-conditioned problem solved by a stable algorithm: accurate answer. A well-conditioned problem solved by an unstable algorithm: poor answer. An ill-conditioned problem solved by any algorithm: inherently inaccurate answer.
Section 7: Numerical Differentiation
The derivative is defined as a limit:
$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$
For a known function, you can compute this analytically. But what if you only have a black-box $f$ that you can evaluate at points? Then you can approximate the derivative numerically:
$$f'(x) \approx \frac{f(x+h) - f(x)}{h}$$
for a small (but not too small) $h$. There are two errors in tension here, and understanding them explains a fundamental limitation of numerical differentiation.
Truncation error. The formula $[f(x+h) - f(x)] / h$ is not exactly $f'(x)$. By Taylor’s theorem:
$$f(x+h) = f(x) + f'(x)h + \frac{f''(x)}{2}h^2 + \cdots$$
Rearranging:
$$\frac{f(x+h) - f(x)}{h} = f'(x) + \frac{f''(x)}{2}h + O(h^2)$$
The truncation error is $O(h)$: it shrinks linearly as $h \to 0$. Smaller $h$ means less truncation error. This suggests making $h$ very small.
Rounding error. The computation $f(x+h) - f(x)$ is the subtraction of two nearly equal numbers when $h$ is small. Catastrophic cancellation. The absolute error from rounding is about $\varepsilon_{\text{mach}} \cdot f(x)$. Dividing by $h$:
$$\text{rounding error in } \frac{f(x+h) - f(x)}{h} \approx \frac{\varepsilon_{\text{mach}} \cdot |f(x)|}{h}$$
This grows as $h \to 0$. Smaller $h$ means more rounding error.
The tradeoff. Total error $\approx \frac{f''(x)}{2} h + \frac{\varepsilon_{\text{mach}} |f(x)|}{h}$. Minimizing over $h$:
$$h^* \approx \sqrt{\varepsilon_{\text{mach}}} \approx 10^{-8} \quad \text{(double precision)}$$
At this optimal $h$, the error is about $\sqrt{\varepsilon_{\text{mach}}} \approx 10^{-8}$. Not $10^{-16}$ - only $10^{-8}$. Numerical differentiation costs you half your significant digits, no matter what you do.
The centered difference formula $[f(x+h) - f(x-h)] / (2h)$ has truncation error $O(h^2)$ instead of $O(h)$, which shifts the optimal $h$ to $h^* \approx \varepsilon_{\text{mach}}^{1/3} \approx 10^{-5}$ and gives error $\approx 10^{-11}$. Better, but still not exact.
Why automatic differentiation is superior. Automatic differentiation (AD) - the technology underlying backpropagation in deep learning - computes derivatives exactly, to machine precision, without this tradeoff. It does not use finite differences. Instead, it applies the chain rule symbolically to the actual computation, tracking how perturbations propagate. The cost is comparable to evaluating $f$ once (for reverse mode AD). No approximation is made. No half-digits are lost.
This is why numerical differentiation is essentially never used for training neural networks. AD gives exact gradients with no tuning required.
Section 8: Why Automatic Differentiation Beats Numerical Differentiation
The two-error tradeoff from Section 6 is unavoidable for any method that approximates the derivative using function evaluations. Automatic differentiation sidesteps it entirely by not approximating at all.
The idea is simple. Every arithmetic operation in a computer program is a differentiable function of its inputs. Adding two numbers, multiplying them, applying a function like $\sin$ or $\exp$ - each has a known derivative. Automatic differentiation (AD) builds a system that evaluates the derivative of the entire computation by applying the chain rule to the sequence of primitive operations.
Dual numbers. Forward mode AD is most cleanly understood through dual numbers. Define a “dual number” as a pair $(a, b)$ representing $a + b\epsilon$ where $\epsilon^2 = 0$ (but $\epsilon \neq 0$). Arithmetic on dual numbers is defined by:
$$(a + b\epsilon) + (c + d\epsilon) = (a + c) + (b + d)\epsilon$$ $$(a + b\epsilon) \cdot (c + d\epsilon) = ac + (ad + bc)\epsilon$$
(Note: $b\epsilon \cdot d\epsilon = bd\epsilon^2 = 0$.)
Now evaluate a function $f$ with a dual number input $x + 1 \cdot \epsilon$ (the primal part is $x$, the dual part is 1). By the rules of dual arithmetic:
$$f(x + \epsilon) = f(x) + f'(x)\epsilon$$
The primal part of the output is $f(x)$. The dual part is $f'(x)$. One evaluation of $f$ with a dual number input gives both the function value and its derivative simultaneously.
This is not an approximation. The dual part tracks the exact derivative via the chain rule at every step. The machine epsilon $\varepsilon_{\text{mach}}$ appears only in the finite-precision representation of $f(x)$ and $f'(x)$ - not in the differentiation process itself.
Reverse mode AD (backpropagation) is the same idea applied in the reverse direction, and is more efficient for functions $f: \mathbb{R}^N \to \mathbb{R}$ with many inputs and one output. The savings are exactly those described in the backpropagation post: one reverse pass gives all $N$ partial derivatives, versus $N$ forward passes.
Section 9: Numerical Integration
The derivative and the integral are inverses, by the fundamental theorem of calculus. But computing integrals numerically has its own challenges.
The problem: compute $\int_a^b f(x) dx$ for a function $f$ that may have no antiderivative in closed form (or whose antiderivative is inconvenient). The approach: approximate the integral as a weighted sum of function values.
The derivative and the integral are inverses, by the fundamental theorem of calculus. But computing integrals numerically has its own challenges.
Trapezoidal rule. Divide $[a, b]$ into $n$ equal subintervals of width $h = (b-a)/n$. On each subinterval, approximate $f$ by a straight line connecting its endpoints, and integrate the line (a trapezoid):
$$\int_a^b f(x) dx \approx h \left[ \frac{f(x_0) + f(x_n)}{2} + f(x_1) + f(x_2) + \cdots + f(x_{n-1}) \right]$$
The error is $O(h^2)$: halving the step size reduces the error by a factor of 4. This is second-order accuracy.
Simpson’s rule. Use quadratic rather than linear interpolation on each pair of subintervals. This matches not just the function values at the endpoints but also the curvature:
$$\int_a^b f(x) dx \approx \frac{h}{3} \left[ f(x_0) + 4f(x_1) + 2f(x_2) + 4f(x_3) + \cdots + 4f(x_{n-1}) + f(x_n) \right]$$
The error is $O(h^4)$: two orders better than the trapezoidal rule for the same number of function evaluations. Simpson’s rule is exact for polynomials of degree up to 3.
Gaussian quadrature. Rather than using equally spaced points, choose the points themselves (nodes) and the weights to maximize accuracy. With $n$ carefully chosen nodes $x_1, \ldots, x_n$ and weights $w_1, \ldots, w_n$:
$$\int_{-1}^1 f(x) dx \approx \sum_{k=1}^n w_k f(x_k)$$
The nodes and weights are chosen so that the formula is exact for all polynomials of degree up to $2n - 1$. This is optimal: no $n$-point formula can do better.
For smooth functions, Gaussian quadrature converges exponentially as $n$ increases - it achieves far more accuracy per function evaluation than any fixed-step method. The tradeoff is that the nodes are not equally spaced, which can complicate implementation.
Section 10: Solving Large Linear Systems Iteratively
For small systems, Gaussian elimination (LU decomposition) works well. Its cost is $O(n^3)$ operations for an $n \times n$ system. But for $n = 10^6$ (common in scientific computing and certain ML applications), $O(n^3) = 10^{18}$ operations is impossible.
Large systems arising in practice are usually sparse: most entries of $A$ are zero. A $10^6 \times 10^6$ matrix with only $10^7$ nonzero entries (10 per row) stores cheaply and has structure that iterative methods can exploit.
Iterative methods start with a guess $x^{(0)}$ and refine it:
$$x^{(k+1)} = T x^{(k)} + c$$
for some matrix $T$ and vector $c$ that depend on the method and the matrix $A$. They converge when the spectral radius $\rho(T) < 1$ (all eigenvalues of $T$ inside the unit disk).
The conjugate gradient method (CG). For symmetric positive definite $A$ (eigenvalues all positive), CG is optimal among iterative methods. It minimizes $\frac{1}{2} x^T A x - b^T x$ over successively larger subspaces (Krylov subspaces).
In theory, CG converges in at most $n$ steps to the exact solution (ignoring rounding). In practice, it converges much faster when $A$ has favorable eigenvalue structure.
The convergence rate depends on the condition number. After $k$ iterations:
$$\frac{|x^{(k)} - x|_A}{|x^{(0)} - x|_A} \leq 2 \left( \frac{\sqrt{\kappa(A)} - 1}{\sqrt{\kappa(A)} + 1} \right)^k$$
For $\kappa(A) = 100$, the factor is $(9/11)^k \approx 0.82^k$: you need about $\log(1/\varepsilon) / \log(1/0.82) \approx 5 \log(1/\varepsilon)$ iterations for precision $\varepsilon$. For $\kappa(A) = 10000$, the factor is $(99/101)^k \approx 0.98^k$: five times more iterations.
Condition number strikes again. The same problem that causes Gaussian elimination to give a wrong answer causes CG to converge slowly.
Preconditioning. The cure is to transform the problem. Instead of solving $Ax = b$, solve $P^{-1}Ax = P^{-1}b$ where $P$ is a matrix chosen to approximate $A$ but be easy to invert. If $P \approx A$, then $P^{-1}A \approx I$, which has condition number 1. Good preconditioners can reduce $\kappa(P^{-1}A)$ dramatically, accelerating CG by orders of magnitude.
Section 11: Why Condition Number Governs Convergence Rate
The condition number appears in three places in this post: sensitivity of linear systems, convergence of iterative solvers, and convergence of gradient descent. This is not a coincidence. It is the same phenomenon.
Consider the quadratic $f(x) = \frac{1}{2} x^T A x$ where $A$ is symmetric positive definite with eigenvalues $\lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n$. The level sets of $f$ (the sets where $f(x) = c$ for constant $c$) are ellipsoids. The axis lengths are proportional to $1/\sqrt{\lambda_i}$.
If all eigenvalues are equal ($\lambda_1 = \lambda_2 = \cdots = \lambda_n$), the ellipsoids are spheres. Gradient descent from any starting point moves directly toward the minimum in one step. Condition number $\kappa = 1$.
If eigenvalues differ widely ($\lambda_n / \lambda_1 \gg 1$), the ellipsoids are elongated - very wide in the directions of small eigenvalues and very narrow in the directions of large eigenvalues. The gradient points toward the nearest part of the ellipsoid, not toward the center. Gradient descent zigzags: it makes progress quickly in the narrow directions but oscillates back and forth in the wide directions. Many steps are needed. Condition number $\kappa = \lambda_n / \lambda_1 \gg 1$.
This is the geometric content of the convergence bound from Section 8. The condition number measures how elongated the loss landscape is. An elongated landscape means zigzagging gradient descent, which means slow convergence.
The same geometry explains why CG converges slowly for ill-conditioned systems. CG is optimal in the sense that it makes the best possible progress at each step within the Krylov subspace. But when eigenvalues are widely spread, the space the algorithm must explore is large, and progress in the direction corresponding to the smallest eigenvalue (the widest axis of the ellipsoid) is slow.
Preconditioning solves this by transforming the problem $Ax = b$ into $P^{-1}Ax = P^{-1}b$ where $P^{-1}A$ has a much smaller condition number. Geometrically, the preconditioner reshapes the elongated ellipsoid into something closer to a sphere, so gradient descent (or CG) on the transformed problem converges quickly.
Section 12: Stability of Gradient Descent
In optimization (and training neural networks), you run gradient descent:
$$x \leftarrow x - \alpha \nabla f(x)$$
Is this stable? Does it converge? The answer depends on the learning rate $\alpha$ and the curvature of $f$.
Consider minimizing a convex quadratic $f(x) = \frac{1}{2} x^T A x - b^T x$ where $A$ is symmetric positive definite. The gradient is $\nabla f(x) = Ax - b$. The update is:
$$x^{(k+1)} = x^{(k)} - \alpha (Ax^{(k)} - b)$$
Let $e^{(k)} = x^{(k)} - x^$ be the error (where $x^ = A^{-1}b$ is the minimizer). Then:
$$e^{(k+1)} = (I - \alpha A) e^{(k)}$$
The error at step $k$ is $(I - \alpha A)^k e^{(0)}$. For this to go to zero, all eigenvalues of $I - \alpha A$ must be inside the unit disk. If $\lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n$ are the eigenvalues of $A$, the eigenvalues of $I - \alpha A$ are $1 - \alpha \lambda_i$. We need:
$$|1 - \alpha \lambda_i| < 1 \quad \text{for all } i$$
For positive $\lambda_i$, this requires $0 < \alpha \lambda_i < 2$, or equivalently $\alpha < 2/\lambda_{\max}$.
If $\alpha < 2/\lambda_{\max}$: gradient descent converges. The convergence rate per step is governed by $\max_i |1 - \alpha \lambda_i|$, which is minimized at $\alpha^* = 2/(\lambda_{\min} + \lambda_{\max})$. At this optimal learning rate, the convergence factor is:
$$\frac{\kappa(A) - 1}{\kappa(A) + 1}$$
The larger $\kappa(A)$, the slower the convergence. For $\kappa(A) = 100$: factor is $99/101 \approx 0.98$, slow. For $\kappa(A) = 10$: factor is $9/11 \approx 0.82$, faster.
If $\alpha > 2/\lambda_{\max}$: the eigenvalue $1 - \alpha \lambda_{\max}$ has absolute value greater than 1. The error oscillates and grows. Gradient descent diverges.
Discomfort check. Why does too-large a learning rate cause oscillation and divergence rather than just slow convergence? The update $x \leftarrow x - \alpha \nabla f(x)$ uses the local gradient to predict where the minimum is. For a quadratic, this prediction is exact at $\alpha = 1/\lambda_{\max}$. At $\alpha > 1/\lambda_{\max}$, the update overshoots: you move past the minimum and end up on the other side. If $\alpha$ is large enough, each overshoot is larger than the last - the iterates oscillate with growing amplitude. This is not a failure of the gradient calculation. It is a failure of using the linear (gradient) approximation beyond its domain of validity.
Section 13: Stiff Equations and Implicit Methods
Here is a situation that arises in chemistry, biology, and control systems. You have a differential equation $y' = f(y)$ where the solution changes at two very different rates: a “fast” mode that decays quickly and a “slow” mode that evolves over a long timescale.
For example: a chemical reaction with one reactant that converts in milliseconds and another that converts in hours. If you use an explicit method (forward Euler, Runge-Kutta 4), you must take a time step small enough to capture the fast dynamics - even if you only care about the slow dynamics. The stability condition forces $h < 2/|\lambda_{\text{fast}}|$, where $\lambda_{\text{fast}}$ is the fast eigenvalue. This can require millions of tiny steps to advance through an hour of slow time.
A system with this property - many active timescales - is called stiff. The word captures the computational cost: you are “stiff” against taking large steps.
Implicit methods solve this by treating the fast dynamics implicitly. The backward Euler method:
$$y_{n+1} = y_n + h f(y_{n+1})$$
This is an equation for $y_{n+1}$, not an explicit formula. For a linear equation $y' = Ay$, it becomes $(I - hA)y_{n+1} = y_n$, which requires solving a linear system at each step. The payoff: backward Euler is unconditionally stable - it produces bounded results for any step size $h$, no matter how large $|\lambda_{\text{fast}}|$ is. The accuracy degrades with large $h$, but the solution does not blow up.
The connection to condition numbers: the matrix $(I - hA)$ that must be solved at each step of an implicit method is often well-conditioned when $A$ has eigenvalues of similar magnitude. When the eigenvalues span many orders of magnitude (the stiff case), the matrix can be ill-conditioned, and the linear solve at each step requires care.
Section 14: The Condition Number and Its Computation
How do you compute the condition number of a matrix in practice? The definition $\kappa(A) = |A| \cdot |A^{-1}|$ requires computing $A^{-1}$, which is exactly as expensive as solving the linear system and potentially numerically unstable itself.
The practical answer: the condition number is $\sigma_{\max} / \sigma_{\min}$ where $\sigma_{\max}$ and $\sigma_{\min}$ are the largest and smallest singular values. Singular value decomposition (SVD) computes all singular values and is backward stable. For an $n \times n$ matrix, SVD costs $O(n^3)$ but provides the condition number as a byproduct.
For very large matrices, computing the exact condition number is too expensive. Instead, estimate it: the power iteration method approximates the largest singular value $\sigma_{\max}$ by repeatedly multiplying by $A^T A$ and normalizing. The smallest singular value is harder - it requires inverse iteration, which involves solving a linear system with $A^T A$ at each step.
Numerical libraries (LAPACK, SciPy) provide condition number estimators that are less expensive than full SVD. They produce lower bounds on the condition number, which are sufficient for deciding whether a problem is ill-conditioned.
In practice, a matrix with $\kappa(A) > 10^{12}$ is effectively singular in double precision (since $\kappa(A) \cdot \varepsilon_{\text{mach}} > 1$). The linear system has no reliable solution. The appropriate response is to regularize the problem: add a small multiple of the identity, $A \leftarrow A + \lambda I$, to improve the condition number at the cost of introducing a regularization bias. This is Tikhonov regularization (or ridge regression in the ML context), and it is a direct numerical response to ill-conditioning.
Section 15: The Unified View
Conditioning, stability, and convergence look like different topics. They are the same phenomenon from three angles.
Condition number measures how sensitive a problem’s answer is to perturbations in the input. A large condition number means the problem is inherently hard - small input errors become large output errors.
Stability of an algorithm measures whether the algorithm achieves the best accuracy the problem allows, or whether it introduces additional errors through poor numerical practice. An unstable algorithm can destroy accuracy even when the problem is well-conditioned. A stable algorithm can do nothing better than the condition number permits.
Convergence rate of an iterative method measures how quickly the algorithm reaches the answer. For both gradient descent and the conjugate gradient method, the convergence rate deteriorates with the condition number - for exactly the same reason as before. Poor conditioning means the problem has directions that are simultaneously very sensitive and very insensitive, and any iterative method must navigate between them.
The unified response to all three is preconditioning: transform the problem into one with a smaller condition number. For linear systems, this means finding a matrix $P \approx A$ that is easy to invert and results in $P^{-1}A$ having condition number close to 1. For gradient descent in ML, this is the motivation for adaptive optimizers (Adam, RMSProp): they use estimates of curvature to effectively scale the gradient differently in each direction, reducing the effective condition number of the optimization problem.
When your computation gives a wrong answer, the first diagnostic question is: what is the condition number? If it is large, the problem is fundamentally hard. If it is small, look for an unstable step in your algorithm - likely a subtraction of nearly equal quantities. The language of condition numbers and stability gives you precise tools for the diagnosis.
A diagnostic procedure. When a numerical computation behaves unexpectedly:
- Check whether the inputs contain nearly equal quantities that are subtracted. If so, look for an algebraically equivalent form that avoids the subtraction (as in the $\sqrt{x+1} - \sqrt{x}$ example).
- Compute or estimate the condition number of any matrix involved. If $\kappa > 1/\varepsilon_{\text{mach}}$, the problem is numerically singular and regularization is required.
- Identify the algorithm. Is it backward stable? Gaussian elimination without pivoting is not. Gram-Schmidt orthogonalization without reorthogonalization is not. Householder QR and LU with partial pivoting are.
- Check whether the computation involves extreme values (large exponents, near-zero denominators). Apply the max-subtraction trick for log-sum-exp, clamp denominators away from zero, or rearrange the formula.
These four checks cover the majority of numerical instability issues encountered in practice.
Section 16: Log-Sum-Exp: the Canonical Numerical Fix
There is a numerical computation that appears so frequently in ML and statistics that it has its own name: the log-sum-exp. Understanding its stable implementation illustrates all the ideas of this post in one example.
The computation is:
$$\text{LSE}(z_1, \ldots, z_k) = \log \sum_{j=1}^k e^{z_j}$$
This appears in the log-partition function of exponential families, in the log-likelihood of a softmax model (cross-entropy loss), in computing the logsumexp of log-probabilities for numerical stability in probabilistic inference, and elsewhere.
The naive computation $\log(\sum_j e^{z_j})$ overflows if any $z_j$ is large (say, greater than 709 in double precision). It underflows to $-\infty$ if all $z_j$ are very negative (the exponentials go to zero before they can be summed).
The stable computation uses the algebraic identity: for any constant $c$,
$$\log \sum_j e^{z_j} = c + \log \sum_j e^{z_j - c}$$
Choosing $c = \max_j z_j$: the largest exponent becomes $e^0 = 1$, all others are at most 1. The sum is at least 1 and at most $k$. No overflow, no underflow.
This is catastrophic cancellation in reverse: instead of subtracting two large quantities, we shift all quantities down before they become large. The fix costs one extra pass over the vector to compute the maximum.
The log-sum-exp function has nice properties beyond numerical stability. It is convex (the log of a sum of convex functions is convex). Its gradient is the softmax vector: $\partial \text{LSE} / \partial z_j = e^{z_j} / \sum_l e^{z_l} = \hat{p}_j$. So if you need both the log-sum-exp and the softmax probabilities, you can compute them together with a single stable implementation.
Section 17: Worked Example - the Softmax in Practice
Here is a case where these ideas appear directly in ML code.
The softmax function maps a vector $z \in \mathbb{R}^k$ to a probability distribution:
$$\text{softmax}(z)i = \frac{e^{z_i}}{\sum{j=1}^k e^{z_j}}$$
If any $z_i$ is large (say, $z_i = 1000$), then $e^{z_i} = e^{1000}$ is astronomically large - far beyond the range of double precision floating point, which saturates at about $e^{709}$. The result is inf, and the division inf / inf produces nan. The computation fails entirely.
The fix is algebraic, exactly like the $\sqrt{x+1} - \sqrt{x}$ fix from Section 3. Subtract the maximum entry first:
$$\text{softmax}(z)_i = \frac{e^{z_i - c}}{\sum_j e^{z_j - c}}, \quad c = \max_j z_j$$
This is mathematically identical to the original - subtracting $c$ from every exponent multiplies numerator and denominator by $e^{-c}$, which cancels. But now the largest entry in the exponent is $z_i - c \leq 0$, so $e^{z_i - c} \leq 1$. Overflow is impossible. And the denominator has at least one term equal to $e^0 = 1$, so there is no underflow to zero either.
The numerically stable implementation is:
- Compute $c = \max_j z_j$.
- Compute $e^{z_j - c}$ for each $j$.
- Sum them: $S = \sum_j e^{z_j - c}$.
- Return $e^{z_i - c} / S$ for each $i$.
Four lines instead of one. This is the practical meaning of numerical stability: write formulas in the mathematically equivalent form that avoids cancellation, overflow, and underflow.
Now combine with cross-entropy loss. In classification, you compute the cross-entropy loss $L = -\sum_i y_i \log \hat{p}_i$ where $\hat{p}_i = \text{softmax}(z)_i$. Naively: compute softmax, take log, multiply by labels, sum. But taking the log of a number that was computed as $e^{z_i - c} / S$ reintroduces cancellation:
$$\log \hat{p}_i = \log \frac{e^{z_i - c}}{S} = (z_i - c) - \log S$$
This is exact - no cancellation occurs. But $\log S = \log \sum_j e^{z_j - c}$ is itself a log-sum-exp computation. The numerically stable cross-entropy combines the two operations:
$$L = -\sum_i y_i [(z_i - c) - \log S]$$
No exponential is ever stored as a final result; only the shifted values $z_i - c$ and their log-sum-exp appear. This is why ML frameworks provide a combined log_softmax function and a combined cross_entropy_loss that fuses the softmax, log, and cross-entropy computation into a single numerically stable operation.
The lesson: stability is not just a property of individual operations. The overall computation chain must be analyzed as a whole, and algebraically equivalent reformulations that are more stable are not optional refinements - they are often the difference between a functioning implementation and one that produces nan on common inputs.
Section 18: A Summary Table
| Concept | Definition | Consequence of getting it wrong |
|---|---|---|
| Machine epsilon $\varepsilon_{\text{mach}}$ | Smallest $\varepsilon$ with $\text{fl}(1 + \varepsilon) > 1$ | Errors smaller than $\varepsilon_{\text{mach}}$ are invisible |
| Catastrophic cancellation | Subtracting nearly equal numbers | Significant digits destroyed; relative error explodes |
| Condition number $\kappa(A)$ | $|A| \cdot |A^{-1}|$ | Amplifies input errors by factor $\kappa$; ill-conditioned problems have no reliable solutions |
| Backward stability | Computed answer is exact for nearby input | Without it, errors exceed $\kappa \varepsilon_{\text{mach}}$ even for well-conditioned problems |
| Optimal $h$ for finite differences | $h^* \approx \sqrt{\varepsilon_{\text{mach}}}$ | Loss of half the significant digits; use AD instead |
| Convergence factor for CG | $(\sqrt{\kappa} - 1)/(\sqrt{\kappa} + 1)$ per iteration | Large $\kappa$ means slow convergence |
| Stability condition for gradient descent | $\alpha < 2/\lambda_{\max}$ | Violation causes oscillation and divergence |
Read Next: