Autodiff (Forward/Reverse)
Prerequisite:
Why Not Symbolic or Numerical Differentiation?
To train neural networks, we need $\partial L / \partial \theta$ for all parameters $\theta$. Two naive approaches both fail at scale.
Symbolic differentiation applies calculus rules algebraically to produce a closed-form expression for the derivative. For a composition of $n$ operations, the symbolic derivative can grow exponentially in size due to repeated subexpressions. This “expression swell” makes it impractical for networks with millions of operations.
Numerical differentiation uses finite differences:
$$\frac{\partial f}{\partial x_i} \approx \frac{f(x + \varepsilon e_i) - f(x)}{\varepsilon},$$
where $e_i$ is the $i$-th standard basis vector. Each partial derivative requires a separate forward pass, so computing $\nabla_\theta L$ for $p$ parameters costs $O(p)$ forward passes. With $p \sim 10^9$ in modern models, this is completely infeasible. Moreover, finite differences suffer from catastrophic cancellation at small $\varepsilon$ and large truncation error at large $\varepsilon$.
Automatic differentiation (autodiff) avoids both problems by working on the computational graph at the level of elementary operations.
Computational Graphs
Any differentiable computation $f: \mathbb{R}^n \to \mathbb{R}^m$ can be decomposed into a sequence of elementary operations (addition, multiplication, $\exp$, $\log$, etc.), forming a directed acyclic graph (DAG) where nodes are intermediate values and edges represent dependencies.
For example, $f(x, y) = (x + y) \cdot \sin(x)$ decomposes as:
- $v_1 = x + y$
- $v_2 = \sin(x)$
- $v_3 = v_1 \cdot v_2$
The Jacobian of $f$ is computed by applying the chain rule along the edges of this graph.
Forward Mode: Dual Numbers
Forward mode autodiff propagates derivatives alongside values in a single forward pass. It is formalized using dual numbers: numbers of the form $a + b\varepsilon$ where $\varepsilon^2 = 0$ (a formal nilpotent). Arithmetic extends as:
$$(a + b\varepsilon) + (c + d\varepsilon) = (a+c) + (b+d)\varepsilon$$ $$(a + b\varepsilon)(c + d\varepsilon) = ac + (ad + bc)\varepsilon$$
For an elementary function $g$, define $g(a + b\varepsilon) = g(a) + g'(a)b\varepsilon$. This rule propagates through compositions: evaluating $f$ on the dual number $x_i + 1\cdot\varepsilon$ computes $f(x)$ in the real part and $\partial f / \partial x_i$ in the dual part - one pass per input dimension.
When is forward mode efficient? Forward mode computes one Jacobian-vector product (JVP) per pass:
$$\text{JVP}(\dot{x}) = J_f \cdot \dot{x}, \quad J_f \in \mathbb{R}^{m \times n}.$$
To get the full Jacobian, run $n$ forward passes (one per input). This is efficient when $n \ll m$ - few inputs, many outputs.
Reverse Mode: Backpropagation
Reverse mode autodiff runs the computation forward to cache intermediate values, then traverses the graph backward to accumulate derivatives. It computes vector-Jacobian products (VJPs):
$$\text{VJP}(\bar{v}) = \bar{v}^T J_f, \quad \bar{v} \in \mathbb{R}^m.$$
One backward pass computes the full gradient $\nabla_\theta L = \bar{v}^T J_f$ for a scalar loss $L$ (i.e., $m=1$, $\bar{v} = 1$). This requires only one backward pass regardless of the number of parameters $p$. For neural network training where $p \gg 1$ and $m = 1$, reverse mode is the clear winner.
Complexity. The time complexity of reverse mode is $O(T)$ where $T$ is the cost of the forward pass - a constant factor overhead (typically 2–5x) over the forward computation. The memory cost is $O(T)$ as well, since all intermediate activations must be stored for the backward pass.
Chain Rule as Matrix Product
For a composition $f = g \circ h$ with $h: \mathbb{R}^n \to \mathbb{R}^k$ and $g: \mathbb{R}^k \to \mathbb{R}^m$, the chain rule in matrix form is:
$$J_f = J_g \cdot J_h, \quad J_f \in \mathbb{R}^{m \times n}, ; J_g \in \mathbb{R}^{m \times k}, ; J_h \in \mathbb{R}^{k \times n}.$$
Forward mode evaluates this product left-to-right as $J_g \cdot (J_h \cdot \dot{x})$, computing a $k$-vector then an $m$-vector. Reverse mode evaluates it right-to-left as $(\bar{v}^T J_g) \cdot J_h$, computing an $m$-vector then an $n$-vector. The associativity of matrix multiplication means both are correct; efficiency depends on the dimensions.
Explicit Backprop for a 2-Layer MLP
Consider the 2-layer MLP with parameters $W_1, W_2$ and scalar loss $L$:
$$\mathbf{z}_1 = W_1 \mathbf{x} + \mathbf{b}_1, \quad \mathbf{h} = \sigma(\mathbf{z}_1), \quad \mathbf{z}_2 = W_2 \mathbf{h} + \mathbf{b}_2, \quad L = \ell(\mathbf{z}_2, y).$$
Define the upstream gradient at each node as $\delta^{(\cdot)} = \partial L / \partial (\cdot)$. Then:
Output layer (assuming $\partial L / \partial \mathbf{z}_2 = \boldsymbol{\delta}^{(2)}$ is given by the loss derivative): $$\frac{\partial L}{\partial W_2} = \boldsymbol{\delta}^{(2)} \mathbf{h}^T, \quad \frac{\partial L}{\partial \mathbf{b}_2} = \boldsymbol{\delta}^{(2)}.$$
Hidden layer (backpropagate through $W_2$ and elementwise $\sigma$): $$\boldsymbol{\delta}^{(\mathbf{h})} = W_2^T \boldsymbol{\delta}^{(2)}, \quad \boldsymbol{\delta}^{(\mathbf{z}_1)} = \boldsymbol{\delta}^{(\mathbf{h})} \odot \sigma'(\mathbf{z}_1).$$
Input layer: $$\frac{\partial L}{\partial W_1} = \boldsymbol{\delta}^{(\mathbf{z}_1)} \mathbf{x}^T, \quad \frac{\partial L}{\partial \mathbf{b}_1} = \boldsymbol{\delta}^{(\mathbf{z}_1)}.$$
Each gradient is a local operation on the computational graph: multiply by the transposed weight matrix, then apply the elementwise derivative of the activation. This pattern generalizes to arbitrary depth by induction.
Higher-Order Derivatives
Autodiff composes with itself. To compute Hessian-vector products $H \cdot v = \nabla^2 L \cdot v$ without materializing the full Hessian, run forward mode over reverse mode: differentiate the gradient computation with respect to $v$. This costs $O(T)$ - the same as computing the gradient - and is used in second-order optimization methods and meta-learning.
Examples
PyTorch autograd. PyTorch builds a computational graph dynamically during the forward pass (“define-by-run”). Each tensor tracks its grad_fn, the operation that created it. Calling .backward() traverses this graph in reverse, accumulating .grad attributes. The interface hides the VJP machinery - users write only the forward pass.
Gradient checkpointing. Reverse mode stores all forward activations for the backward pass, costing $O(T)$ memory in the number of operations. For very deep networks, this can exceed GPU memory. Gradient checkpointing (also called rematerialization) stores only every $\sqrt{T}$-th activation and recomputes intermediate values during the backward pass. This trades memory from $O(T)$ to $O(\sqrt{T})$ at the cost of one additional forward pass - a favorable trade for memory-constrained training.
Read Next: