Prerequisite:


The chain rule is perhaps the most-used theorem in all of calculus. It tells you how to differentiate a composition of functions, and it underlies every gradient computation in modern machine learning. Backpropagation is not a separate algorithm layered on top of calculus - it is the chain rule, applied systematically on a computational graph.

The Chain Rule: Single Variable

Theorem (Chain Rule). Let $g$ be differentiable at $a$, and let $f$ be differentiable at $g(a)$. Then $f \circ g$ is differentiable at $a$, and

$$(f \circ g)'(a) = f'(g(a)) \cdot g'(a).$$

Proof

The naive attempt - write the difference quotient and split it - runs into a division-by-zero problem when $g(x) - g(a) = 0$ near $a$ (which can happen even for non-constant $g$). The clean fix uses an auxiliary function.

Since $f$ is differentiable at $b = g(a)$, define

$$\epsilon(k) = \begin{cases} \dfrac{f(b+k) - f(b)}{k} - f'(b) & k \neq 0 \ 0 & k = 0 \end{cases}$$

Then $\epsilon$ is continuous at $0$ and $f(b+k) - f(b) = (f'(b) + \epsilon(k)),k$ for all $k$.

Set $k = g(x) - g(a)$. For $x \neq a$:

$$\frac{(f \circ g)(x) - (f \circ g)(a)}{x - a} = \frac{f(g(a)+k) - f(g(a))}{x-a} = \bigl(f'(g(a)) + \epsilon(k)\bigr)\cdot\frac{g(x)-g(a)}{x-a}.$$

As $x \to a$: $g(x) \to g(a)$ (differentiability implies continuity), so $k \to 0$, so $\epsilon(k) \to 0$. The right side therefore converges to $f'(g(a)) \cdot g'(a)$. $\blacksquare$

Iterated compositions

By induction, for a chain $h = f_n \circ f_{n-1} \circ \cdots \circ f_1$:

$$h'(x) = f_n'(f_{n-1}(\cdots f_1(x)\cdots)) \cdot f_{n-1}'(\cdots) \cdots f_1'(x).$$

This is the formula backpropagation computes - one factor per layer.

Chain Rule: Multivariable Setting

Theorem. Let $\mathbf{g}: \mathbb{R}^m \to \mathbb{R}^n$ be differentiable at $\mathbf{a}$, and $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^k$ differentiable at $\mathbf{g}(\mathbf{a})$. Then $\mathbf{f} \circ \mathbf{g}$ is differentiable at $\mathbf{a}$ and its Jacobian satisfies

$$J_{\mathbf{f} \circ \mathbf{g}}(\mathbf{a}) = J_{\mathbf{f}}(\mathbf{g}(\mathbf{a})) \cdot J_{\mathbf{g}}(\mathbf{a}),$$

where the right side is matrix multiplication: a $k \times m$ matrix equals a $k \times n$ matrix times an $n \times m$ matrix.

The Jacobian $J_{\mathbf{f}}$ has $(i,j)$-entry $\partial f_i / \partial x_j$.

Scalar function of scalar paths

A frequently seen special case: $z = f(x, y)$ with $x = x(t)$, $y = y(t)$. Here $g: \mathbb{R} \to \mathbb{R}^2$ and $f: \mathbb{R}^2 \to \mathbb{R}$, so $J_f = [\partial f/\partial x,; \partial f/\partial y]$ and $J_g = [dx/dt,; dy/dt]^\top$. Their product gives the scalar

$$\frac{dz}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}.$$

More variables: if $z = f(x_1, \ldots, x_n)$ with each $x_i = x_i(t_1, \ldots, t_m)$,

$$\frac{\partial z}{\partial t_j} = \sum_{i=1}^{n} \frac{\partial f}{\partial x_i} \frac{\partial x_i}{\partial t_j}.$$

This sum-over-paths structure is exactly what a computational graph encodes.

Implicit Differentiation

Suppose $F(x, y) = 0$ defines $y$ implicitly as a function of $x$ near a point where $F_y \neq 0$. Differentiating $F(x, y(x)) = 0$ via the chain rule:

$$\frac{\partial F}{\partial x} + \frac{\partial F}{\partial y}\frac{dy}{dx} = 0 \implies \frac{dy}{dx} = -\frac{F_x}{F_y}.$$

No need to solve for $y$ explicitly. This generalises to the Implicit Function Theorem in higher dimensions.

Backpropagation as Reverse-Mode Automatic Differentiation

Computational graphs

A computational graph represents $f$ as a directed acyclic graph (DAG): each node computes one elementary operation; edges carry intermediate values.

     x ──┐
          ├──[×]── u ──[exp]── v ──┐
     y ──┘                         ├──[+]── L
                              w ───┘

For $L = e^{xy} + w$:

  • Node 1: $u = x \cdot y$
  • Node 2: $v = e^u$
  • Node 3: $L = v + w$

Forward pass

Evaluate each node in topological order: $u \leftarrow x \cdot y$, $v \leftarrow e^u$, $L \leftarrow v + w$. Store intermediate values - they are needed in the backward pass.

Backward pass

Define the adjoint (upstream gradient) of node $n$ as $\bar{n} = \partial L / \partial n$.

Seed: $\bar{L} = 1$.

Propagate in reverse topological order. At each node, multiply the incoming adjoint by the local Jacobian (a scalar here, since $L$ is scalar-valued):

$$\bar{v} = \bar{L} \cdot \frac{\partial L}{\partial v} = 1 \cdot 1 = 1$$

$$\bar{u} = \bar{v} \cdot \frac{\partial v}{\partial u} = 1 \cdot e^u = e^{xy}$$

$$\bar{x} = \bar{u} \cdot \frac{\partial u}{\partial x} = e^{xy} \cdot y, \qquad \bar{y} = \bar{u} \cdot \frac{\partial u}{\partial y} = e^{xy} \cdot x$$

$$\bar{w} = \bar{L} \cdot 1 = 1$$

The final adjoints $\bar{x}, \bar{y}, \bar{w}$ are exactly $\partial L/\partial x$, $\partial L/\partial y$, $\partial L/\partial w$. The chain rule guarantees correctness at every step.

Why reverse mode is efficient for scalar outputs

Suppose the network has $n$ scalar inputs and a single scalar output $L$. We want $\nabla L \in \mathbb{R}^n$, i.e., $n$ partial derivatives.

  • Forward (tangent) mode: propagate a tangent vector $\dot{x} = e_i$ (one standard basis vector) through the graph to get $\partial L / \partial x_i$. Repeat $n$ times. Cost: $O(n)$ passes.
  • Reverse mode: one backward pass computes all $n$ partial derivatives simultaneously. Cost: $O(1)$ passes (constant times the forward cost).

For neural networks where $n$ can be $10^8$ and there is one loss scalar, reverse mode is the only tractable choice.

If instead there are $k$ scalar outputs and you want the full $k \times n$ Jacobian, forward mode needs $n$ passes while reverse mode needs $k$ passes. When $k \gg n$ (e.g., computing sensitivities of many outputs to few inputs), forward mode wins.

Connection to PyTorch and JAX

In PyTorch, every tensor operation records itself on a tape during the forward pass. Calling .backward() on a scalar loss traverses this tape in reverse, accumulating .grad values exactly as described above. The tape is a runtime representation of the computational graph.

JAX’s grad and vjp (vector-Jacobian product) are the mathematical primitives of reverse mode, made composable and JIT-compilable. jvp (Jacobian-vector product) is forward mode.

Neither framework does symbolic differentiation. They do automatic differentiation: exact derivatives (not numerical approximations) computed by applying the chain rule to elementary operations with known derivatives.

Summary of Key Results

Setting Formula
Single variable $(f \circ g)'(a) = f'(g(a)),g'(a)$
Multivariable $J_{f\circ g}(\mathbf{a}) = J_f(\mathbf{g}(\mathbf{a}))\cdot J_{\mathbf{g}}(\mathbf{a})$
Scalar along path $dz/dt = \sum_i (\partial f/\partial x_i)(dx_i/dt)$
Implicit $dy/dx = -F_x/F_y$
Backprop (reverse) $\bar{x} = \bar{y} \cdot \partial y / \partial x$ at each edge

The chain rule is not an approximation or a heuristic. It is an exact identity about how rates of change compose, and backpropagation is its algorithmic incarnation.


Read Next: