Helpful context:


GPT-3 has 175 billion parameters. During training, after each forward pass through the network, you need the gradient of the loss with respect to every one of those parameters. All 175 billion partial derivatives.

If you tried to compute each derivative by hand - perturbing one parameter, measuring how the loss changes, dividing - it would take far longer than the age of the universe.

Automatic differentiation (autodiff) does it exactly, and in roughly the same time as a single forward pass. It is not numerical approximation. It is not symbolic algebra that produces enormous closed-form expressions. It is something better than both: an algorithm that applies the chain rule systematically to a computation graph, accumulating exact derivatives with minimal overhead.

Understanding autodiff means understanding how any modern deep learning framework - PyTorch, JAX, TensorFlow - actually works under the hood.


Three Ways to Differentiate

There are three approaches to computing derivatives of a function implemented as a program. They differ in accuracy, cost, and what they can handle.

Numerical differentiation. Approximate the derivative by a finite difference:

$$\frac{df}{dx} \approx \frac{f(x + \varepsilon) - f(x)}{\varepsilon}$$

or the more accurate centered version:

$$\frac{df}{dx} \approx \frac{f(x + \varepsilon) - f(x - \varepsilon)}{2\varepsilon}$$

The centered version has error $O(\varepsilon^2)$ rather than $O(\varepsilon)$.

Problems: floating-point arithmetic introduces cancellation error when $\varepsilon$ is small (the numerator is the difference of two nearly equal numbers). And to compute the gradient of a scalar function of $n$ parameters, you need $n$ evaluations of $f$ - one per parameter. For 175 billion parameters, $175 \times 10^9$ forward passes per gradient step. Completely infeasible.

Numerical differentiation is useful for gradient checking (verifying that your autodiff implementation is correct) but not for training.

Symbolic differentiation. Apply the rules of calculus algebraically: the product rule, chain rule, etc., to a symbolic expression. A computer algebra system (Mathematica, SymPy) can do this.

Problems: symbolic differentiation suffers from expression swell. Differentiating a complex expression can produce an expression exponentially larger than the original. The derivative of $f \circ f \circ \cdots \circ f$ ($n$ times) grows without bound as $n$ increases. More practically, symbolic differentiation doesn’t handle loops, conditional statements, or dynamically-constructed programs - which is how all real neural networks are implemented.

Automatic differentiation. Record what computations are performed (in what’s called a Wengert list or computation graph), then apply the chain rule systematically to that record. Exact (within floating point). Handles loops, conditionals, recursion. Overhead proportional to the number of operations in the forward pass, not the number of parameters.

This is what PyTorch, JAX, and TensorFlow implement.


The Computational Graph

Every computation can be represented as a directed acyclic graph (DAG) where:

  • Nodes are variables (inputs, intermediate values, outputs).
  • Edges represent primitive operations: addition, multiplication, $\exp$, $\sin$, etc.

Each primitive operation knows two things: how to compute its output from its inputs (the forward rule), and how to compute the derivative of its output with respect to each input (the backward rule).

As a simple example, consider $f(x, y) = (x + y) \cdot \sin(x)$.

The computation graph has nodes:

$$a = x + y \qquad b = \sin(x) \qquad f = a \cdot b$$

During the forward pass, we evaluate each node in order: $a$, then $b$, then $f$. During the backward pass, we propagate derivative information in reverse order: $f$, then $a$ and $b$, then $x$ and $y$.

The chain rule tells us how to propagate: the derivative of the final output with respect to any intermediate node equals the derivative of the output with respect to that node’s outputs, times the local Jacobian at that node.


Forward Mode Autodiff

In forward mode, we propagate derivatives alongside the primal computation.

Along with each value $v$, we carry its tangent: $\dot{v} = dv/dx_{\text{input}}$ for some chosen input variable $x_{\text{input}}$.

At each primitive operation $v = f(u_1, u_2)$, we update the tangent using the chain rule:

$$\dot{v} = \frac{\partial f}{\partial u_1} \dot{u}_1 + \frac{\partial f}{\partial u_2} \dot{u}_2$$

For our example $f(x, y) = (x+y)\sin(x)$, suppose we want $\partial f/\partial x$. Initialize $\dot{x} = 1$, $\dot{y} = 0$.

Forward pass:

  • $a = x + y$, $\dot{a} = \dot{x} + \dot{y} = 1 + 0 = 1$
  • $b = \sin(x)$, $\dot{b} = \cos(x) \cdot \dot{x} = \cos(x)$
  • $f = a \cdot b$, $\dot{f} = \dot{a} \cdot b + a \cdot \dot{b} = \sin(x) + (x+y)\cos(x)$

We’ve computed $\partial f/\partial x = \sin(x) + (x+y)\cos(x)$ in a single forward pass.

To compute $\partial f/\partial y$ instead, we’d initialize $\dot{x} = 0$, $\dot{y} = 1$ and repeat.

The cost: one forward pass per input we want to differentiate with respect to. To compute the full gradient of a scalar function $f: \mathbb{R}^n \to \mathbb{R}$, we need $n$ forward passes.

What forward mode computes: Given a direction vector $v \in \mathbb{R}^n$, a single forward mode pass computes the Jacobian-vector product $J v$ where $J = df/dx \in \mathbb{R}^{m \times n}$ is the Jacobian. Setting $v = e_i$ (the $i$-th standard basis vector) gives the $i$-th column of $J$, which for $m = 1$ (scalar output) is the $i$-th partial derivative.

Forward mode is efficient when $n$ is small and $m$ is large - few inputs, many outputs. For example, computing sensitivities of many outputs to one parameter.


Reverse Mode Autodiff

Reverse mode is the mathematically dual approach. Instead of propagating derivatives forward from inputs, we propagate them backward from the output.

Forward pass: Evaluate the computation graph in topological order, storing all intermediate values.

Backward pass: Propagate adjoint values $\bar{v} = \partial L/\partial v$ (the derivative of the final output with respect to each node) in reverse topological order.

Initialize $\bar{f} = 1$ (since $\partial f/\partial f = 1$). For each node $v = g(u_1, u_2, \ldots)$ processed in the backward pass, update the adjoints of its inputs:

$$\bar{u}_i \mathrel{+}= \bar{v} \cdot \frac{\partial v}{\partial u_i}$$

The $\mathrel{+}=$ is important: if a node feeds into multiple downstream nodes, its adjoint accumulates contributions from each.

For our example, after the forward pass we have $a, b, f$ stored. Backward pass with $\bar{f} = 1$:

  • Node $f = a \cdot b$: $\bar{a} \mathrel{+}= \bar{f} \cdot b = b = \sin(x)$ and $\bar{b} \mathrel{+}= \bar{f} \cdot a = a = x + y$
  • Node $b = \sin(x)$: $\bar{x} \mathrel{+}= \bar{b} \cdot \cos(x) = (x+y)\cos(x)$
  • Node $a = x + y$: $\bar{x} \mathrel{+}= \bar{a} \cdot 1 = \sin(x)$ and $\bar{y} \mathrel{+}= \bar{a} \cdot 1 = \sin(x)$

Final adjoints: $\bar{x} = \sin(x) + (x+y)\cos(x)$ and $\bar{y} = \sin(x)$.

We’ve computed both $\partial f/\partial x$ and $\partial f/\partial y$ in a single forward pass + one backward pass.

The cost: one forward pass to compute and store intermediate values, one backward pass to compute all adjoints. Total: $O(1)$ overhead per operation, regardless of the number of input variables $n$.

What reverse mode computes: Given a covector $u \in \mathbb{R}^m$, a single reverse mode pass computes the vector-Jacobian product $u^T J \in \mathbb{R}^n$. For machine learning, $m = 1$ (scalar loss), so we initialize $\bar{L} = 1$ and recover all $n$ partial derivatives in one pass.


Why Reverse Mode Wins for ML

Consider a scalar-output function $L: \mathbb{R}^n \to \mathbb{R}$ - a loss function with $n$ parameters.

  • Forward mode: $n$ passes to compute the full gradient $\nabla L \in \mathbb{R}^n$.
  • Reverse mode: 1 forward + 1 backward pass to compute the full gradient.

For neural networks, $n$ is the number of parameters: typically millions to billions. Forward mode requires $n$ passes. Reverse mode requires 2 passes (roughly).

The speedup factor is $n/2$, which for large models is billions. This is why every neural network library implements reverse mode autodiff. The choice is not a matter of preference - it’s a computational necessity.

The reason is the shape of the problem: one output (the scalar loss), many inputs (the parameters). Reverse mode is optimal for one-output functions. Forward mode is optimal for one-input functions (or when there are far more outputs than inputs).


Memory: The Cost of Reverse Mode

Reverse mode requires storing all intermediate values computed during the forward pass. You need them during the backward pass to evaluate the local Jacobians.

For a network computing $h_1 = \sigma(W_1 x)$, $h_2 = \sigma(W_2 h_1)$, $\ldots$, $\hat{y} = W_L h_{L-1}$: to compute $\partial L/\partial W_l$, you need $h_{l-1}$ (the input to layer $l$). This means storing every hidden state from the forward pass.

Memory cost: proportional to depth × batch size × layer width. For a large model with depth 96, batch size 512, and hidden size 12288 (GPT-3 scale), this is enormous - tens of gigabytes of GPU memory just for activations.

Gradient checkpointing trades compute for memory. Instead of storing all intermediate activations, store only activations at every $k$-th layer (called checkpoints). During the backward pass, when you need an activation between checkpoints, recompute it from the nearest stored checkpoint.

Checkpointing at every $\sqrt{L}$ layers (where $L$ is the total number of layers) reduces memory from $O(L)$ to $O(\sqrt{L})$ at the cost of one additional forward pass per checkpoint segment. For a 96-layer network, checkpointing every 10 layers roughly reduces activation memory by $10\times$ at the cost of $\sim 1.1\times$ extra compute.

Discomfort check. “Backpropagation” and “reverse mode autodiff” are the same thing. They are two names for the same algorithm, discovered in two different communities. Reverse mode autodiff was developed in the automatic differentiation literature by Linnainmaa (1970) and others, originally for scientific computing. Rumelhart, Hinton, and Williams (1986) independently discovered the same algorithm for training neural networks and called it backpropagation. Their 1986 paper made it famous. The two names coexist in the literature, creating confusion. If you see “backprop,” think “reverse mode autodiff applied to a neural network loss.” They are the same.


The Chain Rule at the Core

Everything in autodiff is an application of the chain rule. In the scalar case:

$$\frac{d}{dx}(f \circ g)(x) = f'(g(x)) \cdot g'(x)$$

In the vector case, for $g: \mathbb{R}^n \to \mathbb{R}^m$ and $f: \mathbb{R}^m \to \mathbb{R}^k$:

$$J_{f \circ g}(x) = J_f(g(x)) \cdot J_g(x)$$

The Jacobian of the composition is the product of Jacobians. For forward mode, we compute this product left-to-right, multiplying by each Jacobian as we go forward. For reverse mode, we compute it right-to-left (starting from the loss end).

Matrix multiplication is associative, so both orders give the same result. The efficiency difference is pure cost: multiplying a $1 \times k$ vector by a $k \times n$ matrix costs $O(kn)$; multiplying a $m \times n$ matrix by an $n \times 1$ vector costs $O(mn)$. For our case ($k = 1$, scalar loss), reverse mode’s $1 \times m$ times $m \times n$ is much cheaper than forward mode’s $m \times n$ times $n \times p$ times $\ldots$.

The mathematical content is identical. The cost depends on the shape of the computation.


Higher-Order Derivatives

Autodiff of autodiff gives higher-order derivatives. If you can compute $\nabla L(w)$, you can differentiate that to get the Hessian $H = \nabla^2 L(w)$ - the matrix of second derivatives.

For a function with $n$ parameters, the Hessian is $n \times n$. At $n = 10^6$, this is $10^{12}$ entries - far too large to store explicitly.

But you can compute Hessian-vector products $Hv$ without forming $H$. The trick: $Hv = \nabla((\nabla L) \cdot v)$. Compute the dot product of the gradient with a fixed vector $v$, then differentiate that scalar with respect to the weights. One reverse mode pass gives $(\nabla L) \cdot v$ as a scalar for any given $w$; differentiating this scalar gives $Hv$ in another reverse mode pass.

Total cost: two forward passes + two backward passes per Hessian-vector product. This is efficient enough to use Hessian information in optimization (Newton-CG, natural gradient methods) without ever forming the full Hessian.

PyTorch’s torch.autograd.functional.hvp and JAX’s jax.hessian make this easy.


Implementation: How Frameworks Build Graphs

PyTorch uses define-by-run (also called dynamic graphs). As you write Python code that performs tensor operations, PyTorch records each operation in a computation graph on the fly. When you call .backward(), it traverses the recorded graph in reverse, computing gradients.

This means the computation graph is recreated at each forward pass - it can change based on control flow (if statements, loops). Debugging is easy because you can insert print statements mid-computation. The cost is that the graph-building overhead happens at every training step.

JAX uses function transformations. jax.grad(f) returns a new Python function that computes the gradient of f. The transformation is applied once; subsequent calls to the gradient function don’t rebuild a graph. JAX traces your function using abstract values to determine the computation structure, then compiles it with XLA. jax.jit adds just-in-time compilation; jax.vmap adds automatic vectorization over batches; jax.grad adds differentiation. These compose arbitrarily.

TensorFlow 2 supports both approaches: eager execution (like PyTorch’s dynamic graphs) and tf.function for tracing and compilation (like JAX’s approach).

ONNX and static graphs. Some production systems compile neural networks to static computation graphs that can be heavily optimized. These require knowing the graph structure ahead of time, which means no dynamic control flow. The price is optimization and portability.

All of these frameworks implement the same mathematical idea - compute forward, propagate adjoint values backward - with different engineering choices about when to build the graph and how to optimize execution.


Autodiff for Any Computation

The power of autodiff is that it applies to any differentiable computation, not just neural networks. Some examples:

Differentiable physics. Simulate a physical system (rigid bodies, fluids, molecular dynamics) using differentiable operations. Backpropagate through the simulation to find parameters (e.g., initial conditions, material properties) that minimize some objective.

Differentiable rendering. Render a 3D scene into a 2D image using differentiable rasterization or ray marching. Backpropagate the loss (how much does the rendered image differ from a target?) to optimize 3D scene parameters. This is how neural radiance fields (NeRF) are trained.

Differentiable algorithms. Sort a list, solve a linear system, find a shortest path - all made differentiable by relaxing discrete operations or using implicit differentiation. This allows end-to-end training of systems that include algorithmic components.

Scientific computing. Differentiate through PDE solvers, ODE integrators, optimization algorithms. This is the domain that originally motivated the development of reverse mode autodiff, long before neural networks made it famous.

In all cases, the mechanism is the same: record the computation graph, propagate adjoints backward.


Summary

Method Cost per gradient Accuracy Handles loops?
Numerical diff $O(n)$ evaluations of $f$ Approximate Yes
Symbolic diff Exponential expression growth Exact No
Forward mode autodiff $O(n)$ passes Exact Yes
Reverse mode autodiff $O(1)$ passes Exact Yes
Concept Key point
Computational graph DAG of primitive ops; each has a forward and backward rule
Forward mode Propagates tangents $\dot{v} = dv/dx_{\text{input}}$ forward; computes $Jv$
Reverse mode Propagates adjoints $\bar{v} = \partial L/\partial v$ backward; computes $v^TJ$
Why reverse mode for ML One output (loss), many inputs (parameters); factor-of-$n$ speedup
Backpropagation Same algorithm as reverse mode; historical name from the ML community
Memory cost Must store all forward activations; gradient checkpointing reduces this
Higher-order Autodiff of autodiff gives Hessian-vector products without forming the Hessian
Frameworks PyTorch (dynamic), JAX (functional transforms), TensorFlow 2 (both)

Autodiff is the infrastructure that makes modern deep learning computationally feasible. Without it, computing the gradient of a 175-billion-parameter model would require 175 billion forward passes per training step. With reverse mode autodiff, it requires two - one forward, one backward. The insight is pure chain rule: propagate derivative information backward through a computation graph, accumulating exact partial derivatives at each node.

Once you have the gradient, you hand it to an optimizer. Once you have an optimizer, you can train any model. This is the complete pipeline that turns a differentiable function and a loss into a trained system.


Read next: