Prerequisite:


Floating-Point Arithmetic

Real computations are performed not in $\mathbb{R}$ but in the finite set $\mathbb{F}$ of floating-point numbers. The IEEE 754 standard specifies how these numbers are represented and how operations on them behave.

A double-precision float occupies 64 bits: 1 sign bit, 11 exponent bits, and 52 mantissa bits. Every nonzero number is stored as $\pm 1.b_1 b_2 \cdots b_{52} \times 2^e$, with representable magnitudes from roughly $10^{-308}$ to $10^{308}$.

Machine epsilon $\varepsilon_{\text{mach}}$ is the smallest number such that $\text{fl}(1 + \varepsilon_{\text{mach}}) > 1$:

$$\varepsilon_{\text{mach}} = 2^{-52} \approx 2.2 \times 10^{-16} \quad (\text{double precision})$$

Every floating-point operation $\text{fl}(a \circ b)$ satisfies $\text{fl}(a \circ b) = (a \circ b)(1 + \delta)$ with $|\delta| \leq \varepsilon_{\text{mach}}$. Errors accumulate across chains of operations.

Catastrophic Cancellation

When two nearly equal quantities are subtracted, significant digits cancel and relative error explodes. For example, $f(x) = \sqrt{x+1} - \sqrt{x}$ suffers cancellation for large $x$. The algebraically equivalent form $f(x) = 1/(\sqrt{x+1}+\sqrt{x})$ is numerically stable.

Cancellation example (x = 10^8):

  sqrt(10^8 + 1) = 10000.00005000...
  sqrt(10^8)     = 10000.00000000...
  Difference     = 0.00005000...    <- only ~4 significant digits remain
  Exact answer   = 0.00004999999975...

Condition Numbers: Measuring Problem Sensitivity

Definition. The condition number of a matrix $A \in \mathbb{R}^{n \times n}$ is

$$\kappa(A) = |A|,|A^{-1}| = \frac{\sigma_{\max}}{\sigma_{\min}}$$

where $\sigma_{\max}$ and $\sigma_{\min}$ are the largest and smallest singular values of $A$.

The condition number measures how much the solution $x$ of $Ax = b$ can change relative to perturbations in $b$ or $A$.

Theorem (Perturbation Bound). If $Ax = b$ and $(A + \delta A)(x + \delta x) = b + \delta b$, then

$$\frac{|\delta x|}{|x|} \lesssim \kappa(A)\left(\frac{|\delta A|}{|A|} + \frac{|\delta b|}{|b|}\right)$$

A system with $\kappa(A) = 10^{10}$ can amplify a perturbation of size $10^{-16}$ (machine epsilon) into an error of size $10^{-6}$ - losing 10 digits of accuracy. This is a property of the problem, not the algorithm.


Forward and Backward Error Analysis

Forward error analysis bounds the difference between the computed answer $\hat{x}$ and the true answer $x$: $|\hat{x} - x|$.

Backward error analysis (Wilkinson’s approach) asks: what is the smallest perturbation $\delta$ of the input such that $\hat{x}$ is the exact answer to the perturbed problem? If this backward error $|\delta|/|{\rm input}|$ is $O(\varepsilon_{\text{mach}})$, the algorithm is backward stable.

Definition. An algorithm for computing $f(x)$ is backward stable if the computed output $\hat{y}$ satisfies $\hat{y} = f(x + \delta x)$ for some $\delta x$ with $|\delta x|/|x| = O(\varepsilon_{\text{mach}})$.

Backward stability implies that the forward error is bounded by $\kappa \cdot \varepsilon_{\text{mach}}$: the algorithm does as well as the problem allows. Gaussian elimination with partial pivoting is backward stable for most practical matrices; Householder QR factorization is backward stable without qualification.


Numerical Differentiation

The simplest approximation to $f'(x)$ is the forward difference:

$$f'(x) \approx \frac{f(x+h) - f(x)}{h}$$

By Taylor’s theorem, the truncation error is $O(h)$. The centered difference formula

$$f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}$$

has truncation error $O(h^2)$ - far superior for the same $h$. However, as $h \to 0$, round-off error (from computing $f(x+h) - f(x)$) grows as $O(\varepsilon_{\text{mach}}/h)$. The optimal $h$ balances these: $h^\ast \approx \varepsilon_{\text{mach}}^{1/2}$ for forward differences, giving error $O(\varepsilon_{\text{mach}}^{1/2})$.

Richardson Extrapolation eliminates the leading error term systematically. If $D(h) = f'(x) + c_2 h^2 + O(h^4)$, then

$$\frac{4D(h/2) - D(h)}{3} = f'(x) + O(h^4)$$

Richardson extrapolation can be applied repeatedly to achieve high-order accuracy.


Numerical Integration

Trapezoidal Rule

Approximate $\int_a^b f(x),dx$ by summing trapezoids over a uniform grid $x_k = a + kh$, $h = (b-a)/n$:

$$T_n(f) = h\left[\tfrac{1}{2}f(x_0) + f(x_1) + \cdots + f(x_{n-1}) + \tfrac{1}{2}f(x_n)\right]$$

Theorem (Error Bound). $\left|\int_a^b f,dx - T_n(f)\right| \leq \frac{(b-a)^3}{12n^2}\max_{[a,b]}|f''| = O(h^2)$.

Simpson’s Rule

Using quadratic interpolation on pairs of sub-intervals:

$$S_n(f) = \frac{h}{3}\left[f(x_0) + 4f(x_1) + 2f(x_2) + 4f(x_3) + \cdots + f(x_n)\right]$$

Error is $O(h^4)$: two orders better than the trapezoidal rule for the same number of function evaluations.

Gaussian Quadrature

For $n$ carefully chosen nodes $x_1, \ldots, x_n$ and weights $w_1, \ldots, w_n$ on $[-1,1]$:

$$\int_{-1}^1 f(x),dx \approx \sum_{k=1}^n w_k f(x_k)$$

The nodes and weights are chosen to integrate polynomials of degree up to $2n-1$ exactly. This is optimal: no $n$-point rule can integrate degree-$2n$ polynomials exactly. For smooth functions, Gaussian quadrature achieves exponential convergence as $n$ increases.


ODE Solvers

Consider the initial value problem $y' = F(t, y)$, $y(t_0) = y_0$.

Euler Method

The simplest integrator: $y_{n+1} = y_n + h,F(t_n, y_n)$. This is first-order accurate: the global error over a fixed time interval is $O(h)$.

Stability. For the test equation $y' = \lambda y$ with $\lambda \in \mathbb{C}$, the Euler update gives $y_{n+1} = (1 + h\lambda)y_n$. Stability requires $|1 + h\lambda| \leq 1$, defining a disk in the $h\lambda$-plane. For $\lambda < 0$ real, this requires $h \leq 2/|\lambda|$ - a restrictive condition for stiff problems.

Runge-Kutta 4 (RK4)

$$k_1 = hF(t_n, y_n), \quad k_2 = hF(t_n+h/2, y_n + k_1/2)$$ $$k_3 = hF(t_n+h/2, y_n + k_2/2), \quad k_4 = hF(t_n+h, y_n+k_3)$$ $$y_{n+1} = y_n + \frac{1}{6}(k_1 + 2k_2 + 2k_3 + k_4)$$

RK4 is fourth-order accurate: global error $O(h^4)$. Its stability region is much larger than Euler’s and contains a portion of the negative real axis up to $h|\lambda| \approx 2.8$.

Stiff Equations

A system is stiff if the solution evolves on multiple timescales simultaneously (e.g., a chemical reaction with fast and slow components). Explicit methods like RK4 require very small $h$ to maintain stability, even if the solution changes slowly. Implicit methods (backward Euler, Crank-Nicolson, implicit Runge-Kutta) solve an algebraic system at each step, allowing much larger $h$ with unconditional stability.

Adaptive Step-Size Control

Modern ODE solvers (Dormand-Prince, LSODA) estimate the local truncation error by comparing two methods of different orders and adjust $h$ accordingly, keeping the error below a user-specified tolerance. This is essential for efficiency: small $h$ only where needed.


Iterative Methods for Linear Systems

For large sparse systems $Ax = b$, direct methods (LU, Cholesky) are often too expensive. Iterative methods generate a sequence $x^{(k)} \to x$.

Jacobi method. Update each component using the previous full iterate: $$x_i^{(k+1)} = \frac{1}{A_{ii}}\left(b_i - \sum_{j \neq i} A_{ij} x_j^{(k)}\right)$$

Converges when $A$ is strictly diagonally dominant.

Gauss-Seidel. Same idea but use updated values immediately: $$x_i^{(k+1)} = \frac{1}{A_{ii}}\left(b_i - \sum_{j < i} A_{ij} x_j^{(k+1)} - \sum_{j > i} A_{ij} x_j^{(k)}\right)$$

Typically converges about twice as fast as Jacobi.

Conjugate Gradient (CG). For symmetric positive definite $A$, CG minimizes $\frac{1}{2}x^T A x - b^T x$ over Krylov subspaces:

$$\mathcal{K}_k(A, r_0) = \text{span}{r_0, Ar_0, A^2 r_0, \ldots, A^{k-1}r_0}$$

Theorem (CG Convergence). The error after $k$ steps satisfies

$$|x^{(k)} - x|_A \leq 2\left(\frac{\sqrt{\kappa(A)}-1}{\sqrt{\kappa(A)}+1}\right)^k |x^{(0)} - x|_A$$

Convergence requires $O(\sqrt{\kappa(A)})$ iterations to reduce the error by a fixed factor. Preconditioning - replacing $Ax = b$ with $P^{-1}Ax = P^{-1}b$ for a good approximation $P \approx A$ - reduces the effective condition number and accelerates convergence.


Eigenvalue Algorithms

Power Iteration. Repeatedly multiply by $A$ and normalize: $v^{(k+1)} = Av^{(k)}/|Av^{(k)}|$. Converges to the dominant eigenvector at rate $|\lambda_2/\lambda_1|^k$.

QR Algorithm. The standard algorithm for dense eigenvalue computation. First reduce $A$ to Hessenberg form (upper triangular plus one subdiagonal) using Householder reflectors - this costs $O(n^3)$ but must only be done once. Then apply QR iterations: $A_k = Q_k R_k$, $A_{k+1} = R_k Q_k$. With shifts, this converges cubically to the Schur form. Total cost: $O(n^3)$.


Examples

float16 and bfloat16. Half-precision (float16) has $\varepsilon_{\text{mach}} \approx 10^{-3}$ and range $\approx 65504$. It is used for forward passes and gradient computation to reduce memory and leverage tensor cores (which compute float16 matrix products natively at high throughput). bfloat16 (brain float 16, from Google Brain) uses 8 exponent bits vs float16’s 5, trading mantissa precision for dynamic range - avoiding the overflow problems that plague float16 in gradient computation.

Mixed Precision Training. The NVIDIA Apex / PyTorch AMP scheme: maintain a master copy of weights in float32, compute forward and backward passes in float16, accumulate gradients in float32, and apply the optimizer update in float32. Loss scaling (multiplying the loss by a large constant before backprop) prevents float16 underflow in gradients.

Mixed precision schematic:

[float32 weights] -> cast down -> [float16 compute]
                                        |
                                   forward + backward
                                        |
                              [float16 gradients * scale]
                                        |
                                   cast up + unscale
                                        |
                         [float32 gradient accumulation]
                                        |
                              [float32 optimizer step]
                                        |
                           [float32 weights updated]

Gradient Checkpointing. Storing all activations during the forward pass requires $O(L \cdot \text{batch} \cdot d)$ memory. Gradient checkpointing stores only activations at checkpoint layers (e.g., every $\sqrt{L}$ layers), recomputing intermediate activations on demand during the backward pass. Memory reduces from $O(L)$ to $O(\sqrt{L})$ at the cost of one extra forward pass per checkpoint segment.

Numerical Stability in Softmax. The softmax function $\sigma(x)_i = e^{x_i}/\sum_j e^{x_j}$ overflows for large $x_i$. The numerically stable implementation subtracts $\max_j x_j$ before exponentiation: $\sigma(x)_i = e^{x_i - c}/\sum_j e^{x_j - c}$, with $c = \max_j x_j$. This is mathematically equivalent and bounded.


Read Next: