Helpful context:


You have trained neural networks using backpropagation, or you have heard that backpropagation is how neural networks learn. But what is it, exactly? Most descriptions fall into one of two failure modes. The high-level version says something like “errors propagate backward through the network” - true but not mathematical. The code-focused version gives you pseudocode with arrays and loops - useful for implementation, but it conceals what is actually happening.

Neither answers the underlying mathematical question: what quantity does backpropagation compute, and why does the algorithm work?

Here is the answer in one sentence: backpropagation computes the partial derivative of the loss with respect to every weight in the network, using the chain rule of calculus, evaluated in reverse order through the computation.

That sentence contains everything. This post unpacks it, step by step, until nothing is hidden.


Section 1: What We Are Trying to Compute

Suppose you have a neural network with weights $w_1, w_2, \ldots, w_N$ - maybe millions of them. You have a loss function $L$ that measures how wrong the network’s predictions are. Training means finding weights that make $L$ small.

The method is gradient descent: repeatedly move the weights in the direction that decreases the loss. For weight $w_k$, the update is:

$$w_k \leftarrow w_k - \eta \frac{\partial L}{\partial w_k}$$

where $\eta > 0$ is the learning rate. To apply this update to every weight, you need $\partial L / \partial w_k$ for every $k$. That is the computation. Every other question about backpropagation is a question about how to compute these partial derivatives efficiently.

The network computes its output as a composition of functions. Each layer applies a function to the previous layer’s output. So the loss $L$ is a function of the weights, but it is not a direct function - it is a function of a function of a function, stacked as many times as there are layers. Computing $\partial L / \partial w_k$ requires differentiating through all those layers. That requires the chain rule.


Section 2: The Single-Variable Chain Rule

Before anything multidimensional, review the single-variable case. Suppose you have a composition of three functions:

$$L = f(a), \quad a = g(z), \quad z = h(w)$$

where $w$ is the weight. The loss $L$ depends on $w$ through two intermediate quantities. The chain rule says:

$$\frac{dL}{dw} = \frac{dL}{da} \cdot \frac{da}{dz} \cdot \frac{dz}{dw}$$

Three factors. Each one is the derivative of a single step of the computation. You compute them separately, then multiply.

Work through a specific example so this is not abstract. Suppose $h(w) = w^2$, $g(z) = \sin(z)$, and $f(a) = a^3$. Then $L = \sin(w^2)^3$. We want $dL/dw$.

Breaking it into steps:

  • $z = w^2$, so $dz/dw = 2w$.
  • $a = \sin(z)$, so $da/dz = \cos(z) = \cos(w^2)$.
  • $L = a^3$, so $dL/da = 3a^2 = 3\sin^2(w^2)$.

The chain rule gives:

$$\frac{dL}{dw} = 3\sin^2(w^2) \cdot \cos(w^2) \cdot 2w$$

Notice the order: you evaluate each factor at the correct intermediate value, not at $w$ directly. The factor $dL/da$ is evaluated at $a = \sin(w^2)$, not at $a = w$. This bookkeeping - evaluating local derivatives at the values computed in the forward pass - is exactly what the backward pass of backpropagation does.

This is the entire logic of backpropagation. In a real network, the scalars become vectors and matrices, and the derivatives become Jacobian matrices. But the structure - one factor per layer, multiplied together, evaluated at intermediate values from the forward pass - is identical.

Discomfort check. The Leibniz notation $\frac{dL}{da} \cdot \frac{da}{dz} \cdot \frac{dz}{dw}$ looks like the $da$’s and $dz$’s cancel, as in ordinary fractions. They do not cancel literally - these are derivatives, not fractions - but the chain rule is designed so that the mnemonic works. Treat it as a guide to what factors appear, not as arithmetic.


Section 3: From Scalars to Vectors - the Jacobian

A real neural network layer does not map a number to a number. It maps a vector to a vector. A layer $f: \mathbb{R}^n \to \mathbb{R}^m$ takes an $n$-dimensional input and produces an $m$-dimensional output.

For such a function, the derivative is not a single number. It is a matrix: the Jacobian.

Definition. For a differentiable function $f: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian at a point $x$ is the $m \times n$ matrix:

$$J_f(x) = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{pmatrix}$$

The entry $(J_f)_{ij} = \partial f_i / \partial x_j$ tells you how much the $i$-th output changes when the $j$-th input changes, with everything else held fixed.

Before getting lost in indices, internalize the three special cases you will encounter constantly:

  • $f: \mathbb{R} \to \mathbb{R}$: the Jacobian is a $1 \times 1$ matrix, which is just the ordinary derivative $f'$.
  • $f: \mathbb{R}^n \to \mathbb{R}$: the Jacobian is a $1 \times n$ matrix - a row vector. This is the gradient, written as a row.
  • $f: \mathbb{R}^n \to \mathbb{R}^m$: the Jacobian is an $m \times n$ matrix.

In every case, the Jacobian is the derivative. The shape changes depending on the dimensions of the input and output, but the concept is the same.

The chain rule in matrix form. If $g: \mathbb{R}^n \to \mathbb{R}^m$ and $f: \mathbb{R}^m \to \mathbb{R}^k$, then the Jacobian of the composition $f \circ g$ is:

$$J_{f \circ g}(x) = J_f(g(x)) \cdot J_g(x)$$

This is matrix multiplication of a $k \times m$ matrix by an $m \times n$ matrix, giving a $k \times n$ matrix. In scalar calculus, we multiply derivatives. In vector calculus, we multiply Jacobian matrices. The structure is the same.

Discomfort check. When you multiply two matrices, dimensions must be compatible: the number of columns of the left matrix must equal the number of rows of the right matrix. Here $J_f$ is $k \times m$ and $J_g$ is $m \times n$, so the inner dimension $m$ matches. The result is $k \times n$: it maps $n$-dimensional inputs to $k$-dimensional outputs, which is exactly what $f \circ g$ does. The matrix multiplication encodes the composition.


Section 4: A Concrete Two-Layer Network

Stop thinking abstractly. Build a specific network and compute every Jacobian explicitly.

Setup. Input $x \in \mathbb{R}^d$. One hidden layer:

$$z = W_1 x + b_1, \quad z \in \mathbb{R}^m$$

$$h = \sigma(z), \quad h \in \mathbb{R}^m$$

where $\sigma$ is applied elementwise (for example, $\sigma(t) = 1/(1 + e^{-t})$ or $\sigma(t) = \max(0, t)$). Output layer:

$$\hat{y} = W_2 h + b_2, \quad \hat{y} \in \mathbb{R}^k$$

Loss (squared error):

$$L = |\hat{y} - y|^2 = \sum_{i=1}^k (\hat{y}_i - y_i)^2$$

where $y \in \mathbb{R}^k$ is the true label. We want $\partial L / \partial W_1$ and $\partial L / \partial W_2$.

The computation is a chain:

$$x \xrightarrow{W_1, b_1} z \xrightarrow{\sigma} h \xrightarrow{W_2, b_2} \hat{y} \xrightarrow{|\cdot - y|^2} L$$

Four steps. The chain rule says: take the derivative at each step, multiply them in order.

Step 1: derivative of the loss.

$L = \sum_i (\hat{y}_i - y_i)^2$, so $\partial L / \partial \hat{y}_i = 2(\hat{y}_i - y_i)$.

As a row vector (because $L: \mathbb{R}^k \to \mathbb{R}$, so the Jacobian is $1 \times k$):

$$\frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y)^T$$

This is the upstream gradient from the loss. It has shape $1 \times k$.

Step 2: derivative of the output layer.

The output layer is $\hat{y} = W_2 h + b_2$. We want $\partial \hat{y} / \partial h$.

$\hat{y}i = \sum_j (W_2){ij} h_j + (b_2)_i$, so $\partial \hat{y}i / \partial h_j = (W_2){ij}$.

The Jacobian is:

$$\frac{\partial \hat{y}}{\partial h} = W_2$$

This is the matrix $W_2$ itself, which has shape $k \times m$.

Step 3: derivative of the nonlinearity.

The nonlinearity is $h = \sigma(z)$ applied elementwise: $h_i = \sigma(z_i)$. Output $i$ depends only on input $i$. So:

$$\frac{\partial h_i}{\partial z_j} = \begin{cases} \sigma'(z_i) & i = j \\ 0 & i \neq j \end{cases}$$

The Jacobian is a diagonal matrix:

$$\frac{\partial h}{\partial z} = \text{diag}(\sigma'(z_1), \sigma'(z_2), \ldots, \sigma'(z_m))$$

This has shape $m \times m$. Most of it is zero, which is why elementwise activations are computationally convenient.

Step 4: derivative of the first layer.

The first layer is $z = W_1 x + b_1$. We want $\partial z / \partial W_1$.

This is slightly more involved because $W_1$ is a matrix, not a vector. Rather than computing a full Jacobian of $z$ with respect to every entry of $W_1$ (which would be a 3-dimensional tensor), we directly ask: what is $\partial L / \partial W_1$?

By the chain rule assembled so far, we know $\partial L / \partial z$ (computed in Section 5 below). Given that, observe $z_i = \sum_j (W_1)_{ij} x_j + (b_1)_i$, so:

$$\frac{\partial L}{\partial (W_1)_{ij}} = \frac{\partial L}{\partial z_i} \cdot x_j$$

In matrix form: $\partial L / \partial W_1 = (\partial L / \partial z)^T \cdot x^T$, where $\partial L / \partial z$ is treated as a column vector.


Section 5: Assembling the Chain

The chain rule for this network is:

$$\frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h} \cdot \frac{\partial h}{\partial z} \cdot \frac{\partial z}{\partial W_1}$$

Let us compute backward, starting from the loss.

Gradient at $\hat{y}$:

$$\delta_{\hat{y}} = \frac{\partial L}{\partial \hat{y}} = 2(\hat{y} - y)^T \quad \text{(shape } 1 \times k\text{)}$$

Gradient at $h$ (multiply by $\partial \hat{y} / \partial h = W_2$):

$$\delta_h = \delta_{\hat{y}} \cdot W_2 = 2(\hat{y} - y)^T W_2 \quad \text{(shape } 1 \times m\text{)}$$

This is a row vector of length $m$. It tells you how the loss changes if you perturb each hidden unit.

Gradient at $z$ (multiply by $\partial h / \partial z = \text{diag}(\sigma'(z))$):

$$\delta_z = \delta_h \cdot \text{diag}(\sigma'(z_1), \ldots, \sigma'(z_m)) \quad \text{(shape } 1 \times m\text{)}$$

Because the Jacobian is diagonal, this is just elementwise multiplication:

$$(\delta_z)_i = (\delta_h)_i \cdot \sigma'(z_i)$$

Gradient at $W_1$:

$$\frac{\partial L}{\partial W_1} = \delta_z^T x^T \quad \text{(shape } m \times d\text{)}$$

This is the outer product of $\delta_z^T$ (a column vector of length $m$) with $x^T$ (a row vector of length $d$), giving an $m \times d$ matrix. That matrix has the same shape as $W_1$, which is exactly what you need for the gradient descent update.

Gradient at $W_2$:

Similarly, $\partial L / \partial W_2 = \delta_{\hat{y}}^T h^T$, the outer product of $\delta_{\hat{y}}^T$ (shape $k \times 1$) with $h^T$ (shape $1 \times m$), giving a $k \times m$ matrix with the same shape as $W_2$.

Discomfort check. Where does the outer product $\delta^T x^T$ come from? Think about it dimension by dimension. The gradient $\partial L / \partial (W_1)_{ij}$ is the number $(\delta_z)_i \cdot x_j$ - the $i$-th component of the error signal times the $j$-th component of the input. Assembling all $i$ and $j$ into a matrix gives $\delta_z^T x^T$. A column times a row gives a matrix. This formula will appear in every linear layer - it is not specific to this example.


Section 6: The Upstream Gradient

You may have heard the phrase “upstream gradient” in the context of automatic differentiation. Now that we have a concrete example, we can say precisely what it means.

In the backward pass, at each node in the computation graph, you receive the gradient of the loss with respect to the output of that node. This is the upstream gradient - it has already been computed by the layers above (closer to the loss).

Your job at each node is:

  1. Compute the local Jacobian (the derivative of your output with respect to your input).
  2. Multiply the upstream gradient by the local Jacobian to get the gradient with respect to your input.
  3. Pass that result backward to the next node.

For the output layer in our example: the upstream gradient at $\hat{y}$ is $2(\hat{y} - y)^T$. The local Jacobian $\partial \hat{y} / \partial h$ is $W_2$. Multiply: $(2(\hat{y}-y)^T) \cdot W_2$. This becomes the upstream gradient for the nonlinearity.

The pattern $\text{upstream} \cdot J_{\text{local}}$ repeats at every node. Backpropagation is this pattern, applied backward through all nodes.

Discomfort check. The upstream gradient arrives as a row vector (because $L$ is a scalar, so $\partial L / \partial (\text{anything})$ is a row vector by convention). The local Jacobian is a matrix. Multiplying a row vector by a matrix gives another row vector. This row vector is then passed to the next layer. The shapes propagate consistently if you keep track.


Section 7: Gradient Flow - What Each Gate Does

The chain rule tells you how to compute gradients. There is a separate question worth sitting with: what are the gradients doing as they flow backward?

Andrej Karpathy, in his essay “Yes You Should Understand Backprop,” argues that treating backpropagation as an automatic black box is a leaky abstraction - it hides the consequences you need to reason about when debugging and designing networks. His framing: think of the computation graph not as a function to differentiate but as a circuit through which a signal flows forward (activations) and backward (gradients). Each operation is a gate. Each gate has a distinct personality - a characteristic way of routing and transforming the upstream gradient. Once you internalize the four basic gate behaviors, you can predict gradient pathologies without computing anything.

The add gate: gradient distributor.

Forward: $z = x + y$. Local gradients $\partial z / \partial x = 1$ and $\partial z / \partial y = 1$. Backward:

$$ rac{\partial L}{\partial x} = g, \quad rac{\partial L}{\partial y} = g$$

The add gate copies the upstream gradient $g$ to both inputs unchanged. Neither input value affects how much gradient the other receives. The add gate cannot weaken a gradient.

This is the complete explanation of why residual connections preserve gradient flow. A residual block computes $h = x + f(x)$. In the backward pass, the add gate copies the upstream gradient to both branches: one copy flows into $f(x)$, one flows directly into $x$ with no modification whatsoever. No matter how deep the network, the skip path carries the upstream gradient back to earlier layers at full strength. Skip connections do not “help gradients flow” in some vague sense - the add gate is incapable of reducing the gradient that passes through it.

The multiply gate: gradient swapper.

Forward: $z = xy$. Backward:

$$ rac{\partial L}{\partial x} = g \cdot y, \quad rac{\partial L}{\partial y} = g \cdot x$$

The multiply gate switches: to find the gradient flowing into $x$, scale the upstream gradient by $y$ (the other input), and vice versa. The two inputs swap roles.

In a linear layer $z = wx$, the gradient with respect to the weight $w$ is $g \cdot x$ - the upstream error scaled by the input. This means each weight’s gradient is proportional to the value of the input feeding into it. If input features have wildly different scales - say one feature is always near $0.001$ and another near $1000$ - the corresponding weight gradients differ by a factor of $10^6$. Both weights update in every gradient step, but at completely different effective learning rates. This is not a subtle effect: it is the multiply gate making the gradient literally proportional to the input value. Input normalization corrects for this by ensuring all inputs have comparable magnitudes, so all weight gradients are on the same scale.

The max gate: gradient router.

Forward: $z = \max(x, y)$. Backward: gradient goes entirely to the winner, zero to the loser.

$$ rac{\partial L}{\partial x} = g \cdot \mathbf{1}[x \geq y], \quad rac{\partial L}{\partial y} = g \cdot \mathbf{1}[y > x]$$

The max gate is a binary switch that routes the full gradient to whichever input won the competition, and sends exactly zero to the other. $ ext{ReLU}(x) = \max(0, x)$ is this gate applied to one input against a fixed threshold of zero. When $x > 0$, the gradient passes through at full strength. When $x \leq 0$, the gate shuts completely.

This is the precise explanation of dead ReLU neurons. If a neuron’s pre-activation is negative on every training example - due to a bad initialization or an unlucky large update early in training - the max gate always routes zero gradient backward. Every weight feeding that neuron receives zero gradient. They never update. The neuron is permanently silent: not because the network decided to suppress it, but because the gate mechanically blocked every signal that might have corrected it.

The sigmoid gate: gradient squasher.

Forward: $\sigma(x) = 1/(1 + e^{-x})$. Backward:

$$ rac{\partial L}{\partial x} = g \cdot \sigma(x)(1 - \sigma(x))$$

The local gradient $\sigma(x)(1-\sigma(x))$ has a maximum of $ rac{1}{4}$ at $x = 0$ and falls toward zero at both extremes. Every gradient that passes through a sigmoid is multiplied by at most $0.25$.

This is not a probabilistic tendency. It is a hard ceiling. In a ten-layer network with sigmoid activations, the gradient at the first layer is at most $(0.25)^{10} pprox 10^{-6}$ of the loss gradient - already below double-precision floating-point significance and no information whatsoever about the loss can propagate to the earliest layers. Vanishing gradients in deep sigmoid networks are not a statistical artifact of poor initialization. They are the inevitable consequence of passing through ten consecutive gates that each cut the signal by at least 75%.

ReLU is the solution precisely because it has no squashing behavior for positive inputs: its local gradient is exactly 1 when active, so gradients pass through it at full strength. The max gate either passes the gradient unchanged or blocks it entirely - it cannot attenuate it.

The practical upshot.

Every gradient pathology in deep learning has a gate-level diagnosis:

Symptom Gate Cause
Vanishing gradients (deep sigmoid) Sigmoid squasher Each gate multiplies by $\leq 0.25$
Dead neurons (ReLU) Max router Gate stuck routing zero
Uneven weight updates Multiply swapper Input feature scales differ
Gradient flows through residuals Add distributor Gate copies gradient unchanged

You do not need to compute eigenvalues to diagnose a dead neuron. You need to know what the max gate does. The chain rule is the mechanism; gate intuition is the diagnostic language. Both matter, and neither is sufficient alone.

Discomfort check. Karpathy’s essay “Yes You Should Understand Backprop” argues that even though frameworks compute gradients automatically, the developer who does not understand the gate-level behavior will encounter situations where the training fails silently - a dead neuron, a saturated sigmoid, a gradient that is off by the batch size because the loss averaging was forgotten. The bugs are invisible unless you can picture what the gradient circuit is doing. This is not a call to hand-compute gradients for every network. It is a call to read the circuit, not just trust the machinery.

Section 8: Why Backward and Not Forward?

There are two orders in which you can multiply a chain of matrices. Suppose the chain is:

$$\frac{\partial L}{\partial x} = \underbrace{\frac{\partial L}{\partial \hat{y}}}{1 \times k} \cdot \underbrace{\frac{\partial \hat{y}}{\partial h}}{k \times m} \cdot \underbrace{\frac{\partial h}{\partial z}}{m \times m} \cdot \underbrace{\frac{\partial z}{\partial x}}{m \times d}$$

Matrix multiplication is associative, so you can choose the order. There are two natural choices:

Forward accumulation (left to right): start with $\partial L / \partial \hat{y}$ and multiply each subsequent Jacobian on the right.

At each step, the accumulated result is a row vector. Its size changes as you multiply: $1 \times k$, then $1 \times m$, then $1 \times m$, then $1 \times d$. Each multiplication involves a row vector times a matrix - cheap.

But this computes $\partial L / \partial x$ (derivative with respect to the input). If you also want $\partial L / \partial W_1$, you need a separate pass. If you want the derivative with respect to every weight in every layer, you need one pass per group of weights. For a network with $N$ parameter groups, this is $N$ passes.

Backward accumulation (right to left): start from the rightmost factor and accumulate toward the left.

This is what backpropagation does. You compute:

$$\left(\left(\frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial h}\right) \cdot \frac{\partial h}{\partial z}\right) \cdot \frac{\partial z}{\partial x}$$

At each step the accumulated object is still a row vector (because $L$ is scalar). But crucially: at each step, the current accumulated gradient is exactly the gradient of $L$ with respect to the output of the current layer. That gradient can be used to compute $\partial L / \partial W_l$ for that layer (by the outer product formula), and then passed to the next layer.

So one backward pass gives you $\partial L / \partial W_l$ for every layer $l$ simultaneously. The cost is one backward pass, comparable in expense to one forward pass. To get all the same information with forward mode, you would need one pass per parameter group - potentially millions.

The formal argument. For a function $L: \mathbb{R}^N \to \mathbb{R}$ with $N$ inputs and 1 output, backward mode computes all $N$ partial derivatives in 1 pass. Forward mode computes all $N$ partial derivatives in $N$ passes (one per input direction). For neural networks, $N$ is enormous and outputs are scalar. Backward mode wins by a factor of $N$.


Section 9: The Computational Graph

Any computation can be represented as a directed acyclic graph (DAG), where each node is an operation and each edge carries a value. The forward pass evaluates nodes in topological order, and the backward pass does the same in reverse.

For our two-layer network, the nodes are: multiply by $W_1$, add $b_1$, apply $\sigma$, multiply by $W_2$, add $b_2$, compute $|\cdot - y|^2$. Each has:

  • A forward rule: how to compute the output given the input.
  • A backward rule: how to compute the local Jacobian and multiply it with the upstream gradient.

The backward rule for each operation is computed once and stored. During the backward pass, the stored intermediate values from the forward pass (specifically, the input values at each node) are used to evaluate the local Jacobian.

This is why the forward pass must store intermediate activations. You need $z$ to compute $\text{diag}(\sigma'(z))$. You need $h$ to compute $\partial L / \partial W_2 = \delta_{\hat{y}}^T h^T$. You need $x$ to compute $\partial L / \partial W_1$. The storage cost of backpropagation is proportional to the number of activations you must remember.

Nodes with multiple outputs. What if one node feeds into two downstream nodes? For example, suppose the hidden activations $h$ are used both in the output layer and in a regularization term added to the loss. Then the loss depends on $h$ through two paths.

In a graph with two paths from $h$ to $L$, the chain rule gives:

$$\frac{\partial L}{\partial h_i} = \frac{\partial L_{\text{output}}}{\partial h_i} + \frac{\partial L_{\text{reg}}}{\partial h_i}$$

Gradients from multiple downstream nodes are summed. This is the rule: when a node fans out to multiple children in the computation graph, the upstream gradient at that node is the sum of the upstream gradients flowing back from all its children.

This rule is not an additional assumption - it follows directly from the multivariate chain rule. If $L = L_1(h) + L_2(h)$, then $\partial L / \partial h_i = \partial L_1 / \partial h_i + \partial L_2 / \partial h_i$. Summation of gradients is addition of partial derivatives.

The complete algorithm, stated precisely. Given a computation graph with nodes $v_1, \ldots, v_n$ in topological order (so $v_n$ is the loss):

Forward pass: for $i = 1, \ldots, n$, compute $v_i$ from the values of its inputs. Store all intermediate values.

Backward pass: initialize $\bar{v}_n = 1$ (the “upstream gradient” at the loss node is just 1, since $\partial L / \partial L = 1$). For $i = n-1, n-2, \ldots, 1$:

$$\bar{v}i = \sum{j: j \text{ depends on } i} \bar{v}_j \cdot \frac{\partial v_j}{\partial v_i}$$

At the end, $\bar{v}_i = \partial L / \partial v_i$ for every node $i$. In particular, $\bar{w}_k = \partial L / \partial w_k$ for every weight node $w_k$.

This is backpropagation. It is a dynamic programming algorithm that computes all partial derivatives simultaneously by traversing the graph once backward.


Section 10: Deriving the Jacobian of a Linear Layer from Scratch

The backward rules for a linear layer are used in every network, so they are worth deriving carefully rather than just stating.

Forward: $y = Wx$ where $W \in \mathbb{R}^{m \times n}$ and $x \in \mathbb{R}^n$. Output $y \in \mathbb{R}^m$.

Gradient with respect to $x$. We want $\partial L / \partial x$ given upstream gradient $g = \partial L / \partial y$ (a row vector of length $m$).

Start from the definition. $y_i = \sum_j W_{ij} x_j$, so:

$$\frac{\partial y_i}{\partial x_j} = W_{ij}$$

The Jacobian $\partial y / \partial x$ is the $m \times n$ matrix whose $(i,j)$ entry is $W_{ij}$ - that is, the Jacobian is $W$ itself.

By the chain rule, the gradient with respect to $x$ is:

$$\frac{\partial L}{\partial x} = g \cdot W \quad \text{(row vector } 1 \times n \text{)}$$

Or equivalently, as a column vector: $W^T g^T$.

Intuition: $W^T$ “transposes” the action of $W$. Where $W$ maps input directions to output directions in the forward pass, $W^T$ maps output directions to input directions in the backward pass. This is not a coincidence - it is the adjoint relationship between a linear map and its transpose.

Gradient with respect to $W$. We want $\partial L / \partial W_{ij}$ for each entry of $W$.

$$\frac{\partial L}{\partial W_{ij}} = \sum_l \frac{\partial L}{\partial y_l} \cdot \frac{\partial y_l}{\partial W_{ij}}$$

Since $y_l = \sum_k W_{lk} x_k$, we have $\partial y_l / \partial W_{ij} = x_j$ if $l = i$ and $0$ otherwise. So:

$$\frac{\partial L}{\partial W_{ij}} = \frac{\partial L}{\partial y_i} \cdot x_j = g_i \cdot x_j$$

Assembling over all $(i, j)$: the matrix $\partial L / \partial W$ has entry $(i, j)$ equal to $g_i x_j$. This is the outer product of $g^T$ (column, length $m$) with $x^T$ (row, length $n$):

$$\frac{\partial L}{\partial W} = g^T x^T \quad \text{(shape } m \times n \text{, same as } W \text{)}$$

This derivation uses only the definition of a partial derivative and the chain rule. No tricks, no formulas memorized without understanding.


Section 11: Gradients of Common Operations

Every layer you will encounter in practice has a backward rule derivable from the chain rule. The linear layer was worked out from scratch in Section 9. Here are the other essential ones.

Every layer you will encounter in practice has a backward rule derivable from the chain rule. Here are the most important ones.

Elementwise nonlinearity: $y = \sigma(x)$, so $y_i = \sigma(x_i)$.

Local Jacobian is diagonal: $\partial y / \partial x = \text{diag}(\sigma'(x_1), \ldots, \sigma'(x_n))$.

Given upstream $g$ (row vector of length $n$):

$$\frac{\partial L}{\partial x} = g \odot \sigma'(x)$$

where $\odot$ is elementwise multiplication. The diagonal structure of the Jacobian means we never need to form the full matrix - just multiply elementwise.

Softmax combined with cross-entropy loss.

This is a case where combining two operations simplifies the gradient dramatically. The softmax function $\hat{p}_i = e^{z_i} / \sum_j e^{z_j}$ followed by cross-entropy loss $L = -\sum_i y_i \log \hat{p}_i$ (where $y$ is a one-hot vector) has the combined gradient:

$$\frac{\partial L}{\partial z} = \hat{p} - y$$

Predicted probability minus true probability. This clean formula is why softmax plus cross-entropy is the standard output layer for classification - not just for convenience but because the gradient is numerically stable and easy to interpret.

Discomfort check. Why does the gradient $\hat{p} - y$ not involve the derivative of the softmax? It does - it is just hidden. The softmax Jacobian has a specific form that, when combined with the cross-entropy gradient, telescopes to $\hat{p} - y$. If you work through the full chain rule for softmax and cross-entropy separately, and then multiply, you will recover this formula. The simplification is a consequence of the special structure of the exponential function and the logarithm being inverses.


Section 12: What Vanishing Gradients Actually Are

You have probably heard that deep networks suffer from vanishing gradients. Now you can see exactly why.

The gradient flowing back to layer $l$ is a product of local Jacobians from layers $l+1$ through $L$:

$$\frac{\partial L}{\partial h_l} = \frac{\partial L}{\partial \hat{y}} \cdot J_{f_L} \cdot J_{f_{L-1}} \cdots J_{f_{l+1}}$$

For a network with sigmoid activations: the Jacobian of the nonlinearity at layer $l$ is $\text{diag}(\sigma'(z_l))$, where $\sigma'(z) = \sigma(z)(1 - \sigma(z))$. The maximum value of $\sigma'$ is $1/4$ (achieved at $z = 0$). At every other point it is smaller.

If you multiply 50 Jacobians, each with largest singular value at most $1/4$, the product has singular values at most $(1/4)^{50} \approx 10^{-30}$. The gradient is numerically zero. The first layers receive no gradient signal and stop learning.

ReLU solves this for the nonlinearity: $\text{ReLU}'(z) = 1$ for $z > 0$ and $0$ for $z < 0$. Gradients pass through unchanged (or not at all, but not diminished). This is the primary reason ReLU replaced sigmoid in deep networks - not a vague notion of “better gradients” but a precise statement about the singular values of the Jacobian chain.

Residual connections (as in ResNets) provide another solution. A residual layer computes $h = x + f(x)$ instead of $h = f(x)$. The Jacobian is $I + J_f$ instead of $J_f$. The identity term guarantees the Jacobian always has eigenvalues of at least 1, so gradients cannot vanish along the skip path.

The problem of exploding gradients is the opposite. If the weight matrices are initialized large, the Jacobian of each linear layer has singular values greater than 1. Multiplying 50 such Jacobians amplifies the gradient exponentially. The gradient at the first layer is enormous, causing wildly large updates that destabilize training.

Weight initialization. The standard solutions - Xavier initialization and He initialization - choose the scale of the initial weight entries based on the dimensions of each layer, so that the Jacobian of each layer has singular values close to 1 at initialization. Specifically:

  • Xavier (for $\tanh$ activations): initialize $W_{ij} \sim \text{Uniform}\left(-\sqrt{6/(n_{\text{in}} + n_{\text{out}})}, +\sqrt{6/(n_{\text{in}} + n_{\text{out}})}\right)$
  • He (for ReLU): initialize $W_{ij} \sim \mathcal{N}(0, 2/n_{\text{in}})$

Both are derived by requiring the variance of the output of each layer to equal the variance of its input, so the activation scale is preserved as the signal passes forward and the gradient scale is preserved as the gradient passes backward. Both are consequences of the Jacobian analysis above.


Section 13: Backpropagation Is Not Optional

A question sometimes asked: why can we not just compute $\partial L / \partial w_k$ by finite differences - perturb $w_k$ slightly, measure how $L$ changes, divide?

The numerical derivative $\partial L / \partial w_k \approx (L(w_k + \epsilon) - L(w_k)) / \epsilon$ is correct in principle. But it requires one full forward pass per weight. A network with one million weights requires one million forward passes. At typical network sizes, this is computationally impossible.

Backpropagation computes all $N$ partial derivatives in the cost of roughly two forward passes (one for the forward pass, one for the backward pass). This is not an approximation - it is exact (up to floating-point precision). The efficiency gain is a factor of $N/2$, which for large models is tens of millions.

The efficiency follows entirely from the structure of the chain rule and the choice to evaluate it in reverse order, reusing intermediate results.

Numerical gradient checking. Even though finite differences cannot be used to train large networks, they are useful for verifying that your backpropagation implementation is correct. For a small test network, compute gradients with both backpropagation and finite differences, then compare:

$$\text{relative error} = \frac{|\nabla_{\text{backprop}} L - \nabla_{\text{finite diff}} L|}{|\nabla_{\text{backprop}} L| + |\nabla_{\text{finite diff}} L|}$$

A relative error below $10^{-5}$ usually indicates a correct implementation. Larger errors indicate a bug in the backward pass. Gradient checking is the standard debugging tool for backpropagation implementations.

The centered finite difference formula $[L(w_k + \epsilon) - L(w_k - \epsilon)] / (2\epsilon)$ is more accurate than the one-sided version, giving error $O(\epsilon^2)$ instead of $O(\epsilon)$. Use $\epsilon \approx 10^{-4}$ for checking with double precision.

Memory and the backward pass. One practical constraint: the backward pass must access the activations computed during the forward pass (you need $h$ to compute $\partial L / \partial W_2$, you need $z$ to evaluate $\sigma'(z)$, etc.). For deep networks, storing all intermediate activations requires memory proportional to the number of layers times the batch size times the layer width.

For very deep networks or large batches, this can exceed GPU memory. The solution is gradient checkpointing: recompute activations on demand during the backward pass instead of storing them all. You store only activations at checkpointed layers (every $\sqrt{L}$ layers, say), and recompute intermediate activations when needed. This reduces memory from $O(L)$ to $O(\sqrt{L})$ at the cost of one additional forward pass per segment - a factor-of-2 increase in computation in exchange for a square-root reduction in memory.


Section 14: Beyond Feedforward Networks

The same principles apply to any computation that can be expressed as a graph of differentiable operations. Here are two important cases.

Recurrent neural networks. A recurrent network processes a sequence $x_1, x_2, \ldots, x_T$ by maintaining a hidden state $h_t$ that is updated at each step:

$$h_t = \sigma(W_h h_{t-1} + W_x x_t + b)$$

The loss $L$ depends on the hidden states $h_1, \ldots, h_T$, which in turn depend on the same weight matrices $W_h$ and $W_x$ at every timestep. To compute $\partial L / \partial W_h$, unroll the recurrence and apply the chain rule through time - this is called Backpropagation Through Time (BPTT). The gradient with respect to $W_h$ accumulates contributions from every timestep:

$$\frac{\partial L}{\partial W_h} = \sum_{t=1}^T \frac{\partial L_t}{\partial h_t} \cdot \frac{\partial h_t}{\partial W_h}$$

The vanishing gradient problem is especially severe for RNNs: the gradient from $L_T$ back to $h_1$ passes through $T$ applications of the weight matrix Jacobian. For long sequences, this product typically shrinks to zero. LSTMs and GRUs solve this by adding gating mechanisms that create identity-like paths through time, analogous to residual connections in space.

Attention mechanisms. The attention operation in a transformer is:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

where $Q, K, V$ are matrices. This is a composition of matrix multiplication, scaling, softmax (applied rowwise), and another matrix multiplication. Each operation has a backward rule - the chain rule applies exactly as before. The backward rule for attention is derivable from the rules in Section 10, though the resulting formulas involve matrix products rather than simple elementwise operations.

The practical point: no special mathematical treatment is required for attention. It is a differentiable computation, and its gradients are computed by backpropagation through its computational graph, just as for any other operation.


Section 15: The Complete Picture

Let us return to the one-sentence summary: backpropagation computes the partial derivative of the loss with respect to every weight in the network, using the chain rule of calculus, evaluated in reverse order through the computation.

Each word now has a precise meaning:

  • “partial derivative” - the gradient $\partial L / \partial w_k$ for each weight $w_k$
  • “chain rule” - Jacobians multiplied in order: $J_{f \circ g} = J_f \cdot J_g$
  • “evaluated in reverse order” - starting from the loss and moving backward through layers, so that the upstream gradient is available at each step
  • “reusing intermediate results” - the forward pass stores activations; the backward pass uses them to compute local Jacobians without repeating the forward computation

The word “backpropagation” names the algorithm. The mathematics is: reverse-mode automatic differentiation applied to a composition of functions representing a neural network.

Any other neural network computation you encounter - attention, convolution, normalization layers, residual connections - has a backward rule derivable by the same principles. The local Jacobian changes. The pattern does not.


Section 16: Batch Training and the Gradient Sum

In practice, you do not compute the gradient on a single example. You compute it over a minibatch of $B$ examples $\{(x_i, y_i)\}_{i=1}^B$ and average:

$$L_{\text{batch}} = \frac{1}{B} \sum_{i=1}^B L_i$$

where $L_i$ is the loss on the $i$-th example.

By linearity of the derivative, $\partial L_{\text{batch}} / \partial w_k = \frac{1}{B} \sum_i \partial L_i / \partial w_k$. The batch gradient is the average of the per-example gradients.

Computationally, this does not require $B$ separate backward passes. For a linear layer $y = Wx$, the per-example gradient is $\delta_i^T x_i^T$ (an outer product). Over a batch, the gradient is:

$$\frac{\partial L_{\text{batch}}}{\partial W} = \frac{1}{B} \sum_{i=1}^B \delta_i^T x_i^T = \frac{1}{B} \Delta^T X$$

where $\Delta \in \mathbb{R}^{B \times m}$ stacks the $\delta_i$ as rows and $X \in \mathbb{R}^{B \times d}$ stacks the $x_i$ as rows. This is a single matrix multiplication: the entire batch is processed simultaneously.

This is why GPUs are effective for deep learning. A GPU performs many matrix multiplications in parallel. By organizing the batch as a matrix, the forward pass (and backward pass) become sequences of matrix multiplications over the entire batch, which GPUs can execute at maximum hardware utilization.


Section 17: Gradient Formulas Reference

For any layer with upstream gradient $g = \partial L / \partial y$ (treat as a row vector):

Operation Forward Backward: $\partial L / \partial x$ Backward: $\partial L / \partial W$
Linear $y = Wx$ $y = Wx$ $g W$ (or $W^T g^T$ as column) $g^T x^T$ (outer product)
Affine $y = Wx + b$ $y = Wx + b$ $g W$ $g^T x^T$ for $W$; $g^T$ for $b$
Elementwise $y = \sigma(x)$ $y_i = \sigma(x_i)$ $g \odot \sigma'(x)$ n/a
Softmax + cross-entropy $L = -\sum y_i \log \hat{p}_i$ $\hat{p} - y$ n/a
Squared loss $L = |y - t|^2$ $L = |y - t|^2$ $2(y - t)^T$ n/a
Batch norm $y = (x - \mu)/\sigma$ normalize by batch stats depends on $x$, $\mu$, $\sigma$ jointly n/a

The batch norm gradient is more complex because $\mu$ and $\sigma$ depend on the entire batch. Its derivation follows the same chain rule principles, but requires differentiating through the batch statistics as well as the normalization itself. It is worked out in any autodiff framework’s documentation.

These are not memorization tasks. Each follows from the Jacobian computation described in Section 4 and the chain rule in Section 3. If you forget a formula, rederive it: write out the Jacobian, multiply by the upstream gradient, simplify.

One final observation. Every formula in this table was derived without knowing anything specific about neural networks - only using the chain rule and the shape conventions established in Section 3. When a new operation is added to a neural network library (a new attention variant, a new normalization scheme, a new output head), adding backpropagation support for it requires exactly the same analysis: compute the local Jacobian, multiply by the upstream gradient, determine the gradient with respect to any learned parameters by the outer product formula. The framework is universal.


Read next: