Helpful context:


A neural network layer takes a vector $x \in \mathbb{R}^n$ and produces a vector $y \in \mathbb{R}^m$. To train the network, you need to compute the derivative of the loss with respect to every parameter - and the loss is a scalar, but the layer’s output is a vector. The derivative of a scalar-valued function of a vector is itself a vector (the gradient). But the derivative of a vector-valued function of a vector is a matrix.

What is that matrix? Why is it a matrix? And what does it have to do with the chain rule and backpropagation?

These are the questions this post answers. The answers unify everything: gradients, Jacobians, backpropagation, and the chain rule are all the same idea - the derivative as a linear map - expressed in different dimensions.


Section 1: The Derivative Is Always a Linear Map

Start with the most familiar case.

For $f: \mathbb{R} \to \mathbb{R}$, the derivative $f'(a)$ is a number. At the point $a$, the best linear approximation to $f$ is:

$$f(a + h) \approx f(a) + f'(a) \cdot h$$

The number $f'(a)$ is a linear map from $\mathbb{R}$ to $\mathbb{R}$: it multiplies $h$ by a scalar. In matrix language, it is a $1 \times 1$ matrix.

For $f: \mathbb{R}^n \to \mathbb{R}$, the gradient $\nabla f(x)$ is a vector. At the point $x$, the best linear approximation is:

$$f(x + \delta) \approx f(x) + \nabla f(x) \cdot \delta$$

The gradient $\nabla f(x)$ (as a row vector) is a linear map from $\mathbb{R}^n$ to $\mathbb{R}$: it sends a displacement vector $\delta$ to a scalar via the dot product. In matrix language, it is a $1 \times n$ matrix.

For $f: \mathbb{R}^n \to \mathbb{R}^m$, at the point $x$, the best linear approximation is:

$$f(x + \delta) \approx f(x) + J \delta$$

where $J$ is some $m \times n$ matrix - called the Jacobian. This matrix $J$ is a linear map from $\mathbb{R}^n$ to $\mathbb{R}^m$: it sends a displacement vector $\delta \in \mathbb{R}^n$ to an output displacement in $\mathbb{R}^m$.

The unifying principle: the derivative is always the best linear approximation. It is always a linear map. In coordinates, it is always a matrix. What changes as we go from 1D to $n$D to vector-valued functions is just the shape of that matrix.

Function Derivative Matrix shape
$f: \mathbb{R} \to \mathbb{R}$ $f'(x)$ - a number $1 \times 1$
$f: \mathbb{R}^n \to \mathbb{R}$ $\nabla f(x)$ - a vector $1 \times n$ (row vector)
$f: \mathbb{R}^n \to \mathbb{R}^m$ $J_f(x)$ - the Jacobian $m \times n$

These are not three different things. They are one thing - the derivative as a linear map - in three different dimensional settings.


Section 2: The Total Derivative - Formal Definition

Definition. A function $f: \mathbb{R}^n \to \mathbb{R}^m$ is differentiable at $x$ if there exists a linear map $Df(x): \mathbb{R}^n \to \mathbb{R}^m$ such that:

$$\lim_{|\delta| \to 0} \frac{|f(x + \delta) - f(x) - Df(x)\delta|}{|\delta|} = 0$$

When such a linear map exists, it is unique and called the total derivative (or Fréchet derivative) of $f$ at $x$.

In words: $Df(x)$ is the unique linear map such that $f(x + \delta) \approx f(x) + Df(x)\delta$ with an error that is smaller than $|\delta|$ as $\delta \to 0$.

This is exactly the same definition as for $f: \mathbb{R} \to \mathbb{R}$, where $Df(x) = f'(x)$ is the number that satisfies $f(x + h) \approx f(x) + f'(x) h$ with error $o(h)$. The only change is that everything is now vector-valued and the linear map is a matrix.

Discomfort check. The derivative being a linear map - not just a number - might feel strange. But you have been using this idea all along. When you write $f(x + h) \approx f(x) + f'(x) h$, the right-hand side is the linear function $h \mapsto f'(x) h$ added to $f(x)$. The “linear function $h \mapsto f'(x) h$” is a linear map from $\mathbb{R}$ to $\mathbb{R}$, encoded by the number $f'(x)$. In higher dimensions, the same linear map exists, but it needs a matrix to encode it.


Section 3: The Jacobian - The Derivative in Coordinates

Given $f = (f_1, \ldots, f_m): \mathbb{R}^n \to \mathbb{R}^m$, the matrix of $Df(x)$ in the standard basis is the Jacobian:

$$J_f(x) = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_n} \end{pmatrix}$$

The $(i, j)$ entry is $J_{ij} = \frac{\partial f_i}{\partial x_j}$: how does the $i$-th output change when you change the $j$-th input?

Connecting back to Functions & Mappings. We proved there that every linear map $T: \mathbb{R}^n \to \mathbb{R}^m$ can be represented as an $m \times n$ matrix, where the $j$-th column is where $T$ sends the $j$-th basis vector. The Jacobian is exactly this matrix for the linear map $Df(x)$. The $j$-th column of $J_f$ tells you how a unit perturbation in the $x_j$-direction propagates through $f$ to affect all outputs.

Special cases:

  • $m = 1$: the Jacobian is a $1 \times n$ matrix (row vector). Transposed, this is the gradient $\nabla f$. So the Jacobian of a scalar-valued function is the gradient transposed: $J_f = \nabla f^T$.

  • $m = n$ and $f$ is a linear map given by matrix $A$: $f(x) = Ax$. The Jacobian of a linear map is the matrix itself: $J_{Ax} = A$. The derivative of a linear function is the linear function. This makes sense: the best linear approximation to a linear function at any point is the function itself.

Why the Jacobian IS the linear map from Functions & Mappings. The Jacobian at a point $x$ is the best linear approximation to $f$ near $x$. If $f$ itself were linear ($f(x) = Ax$), the Jacobian would be $A$ everywhere. For nonlinear $f$, the Jacobian is a different matrix at each point - the local linear approximation at that point. Every concept from Functions & Mappings (injectivity, surjectivity, rank, kernel, image) applies to the Jacobian, and it tells you about the local behavior of $f$:

  • The rank of $J_f(x)$ tells you how many independent directions $f$ maps to near $x$.
  • $\ker(J_f(x)) = \{0\}$ (injective Jacobian) means $f$ is locally injective - no two nearby points map to the same output.
  • $\text{Im}(J_f(x)) = \mathbb{R}^m$ (surjective Jacobian) means $f$ locally covers all of $\mathbb{R}^m$.
  • Full rank Jacobian ($\text{rank}(J_f) = \min(m, n)$) is the condition for the inverse function theorem.

Section 4: Computing Jacobians - Worked Examples

Example 1: A linear map. $f(x) = Ax$ where $A$ is an $m \times n$ matrix.

$J_f(x) = A$ everywhere. The Jacobian of a linear map is constant - the map itself.

Example 2: A quadratic map. $f: \mathbb{R}^2 \to \mathbb{R}^2$ defined by $f(x, y) = (x^2 + y, xy)$.

$$J_f = \begin{pmatrix} \frac{\partial(x^2 + y)}{\partial x} & \frac{\partial(x^2 + y)}{\partial y} \\ \frac{\partial(xy)}{\partial x} & \frac{\partial(xy)}{\partial y} \end{pmatrix} = \begin{pmatrix} 2x & 1 \\ y & x \end{pmatrix}$$

At $(1, 2)$: $J_f = \begin{pmatrix} 2 & 1 \\ 2 & 1 \end{pmatrix}$. This matrix has rank 1 ($\det = 2 - 2 = 0$): the map $f$ is locally non-injective at this point. Two different nearby inputs get sent to the same output direction.

At $(1, 0)$: $J_f = \begin{pmatrix} 2 & 1 \\ 0 & 1 \end{pmatrix}$. This has rank 2 ($\det = 2 \neq 0$): the map is locally invertible at this point.

Example 3: Softmax. The softmax function maps $\mathbb{R}^n \to \mathbb{R}^n$:

$$\text{softmax}(x)i = \frac{e^{x_i}}{\sum{j=1}^n e^{x_j}} = s_i$$

The Jacobian entries are:

$$\frac{\partial s_i}{\partial x_j} = \begin{cases} s_i(1 - s_i) & i = j \\ -s_i s_j & i \neq j \end{cases}$$

In matrix form: $J = \text{diag}(s) - s s^T$ (diagonal matrix of softmax outputs minus their outer product). This Jacobian is always singular: its columns sum to zero (because $\sum_i s_i = 1$ implies $\sum_i \frac{\partial s_i}{\partial x_j} = 0$). Softmax is not locally invertible - multiple inputs map to the same output direction (adding a constant to all $x_i$ does not change the softmax output).

Example 4: A neural network layer with ReLU. $f: \mathbb{R}^n \to \mathbb{R}^m$ defined by $f(x) = \text{ReLU}(Wx + b)$, where $W$ is $m \times n$.

The composition: $x \mapsto z = Wx + b \mapsto f = \text{ReLU}(z)$.

The Jacobian of $\text{ReLU}(z)$ with respect to $z$: ReLU acts component-wise, so its Jacobian is diagonal:

$$J_{\text{ReLU}}(z) = \text{diag}(\mathbf{1}[z_1 > 0], \mathbf{1}[z_2 > 0], \ldots, \mathbf{1}[z_m > 0])$$

where $\mathbf{1}[z_i > 0]$ is 1 if the neuron is active and 0 if not.

The Jacobian of $Wx + b$ with respect to $x$ is $W$.

The Jacobian of the full layer (by the chain rule, which we will see next):

$$J_f(x) = J_{\text{ReLU}}(Wx + b) \cdot W = \text{diag}(\mathbf{1}[z > 0]) \cdot W$$

This is $W$ with the rows corresponding to “dead” neurons (where $z_i \leq 0$) zeroed out.


Section 5: The Chain Rule as Matrix Multiplication

In single-variable calculus, if $h = g \circ f$ (apply $f$ first, then $g$), then:

$$h'(x) = g'(f(x)) \cdot f'(x)$$

The derivatives multiply. This is the chain rule.

In multivariable calculus, the analog is:

Theorem (Multivariable Chain Rule). If $f: \mathbb{R}^n \to \mathbb{R}^m$ is differentiable at $x$ and $g: \mathbb{R}^m \to \mathbb{R}^p$ is differentiable at $f(x)$, then $h = g \circ f: \mathbb{R}^n \to \mathbb{R}^p$ is differentiable at $x$ and:

$$J_h(x) = J_g(f(x)) \cdot J_f(x)$$

where $\cdot$ denotes matrix multiplication. The Jacobians multiply.

The mapping is exact. In 1D: $h'(x) = g'(f(x)) \cdot f'(x)$ (scalar multiplication). In higher dimensions: $J_h = J_g \cdot J_f$ (matrix multiplication). Scalar multiplication is the special case of matrix multiplication when all matrices are $1 \times 1$.

The dimensions work out: $J_g$ is $p \times m$, $J_f$ is $m \times n$, so $J_g \cdot J_f$ is $p \times n$, which is the right shape for a map from $\mathbb{R}^n$ to $\mathbb{R}^p$.

Discomfort check. Why does the chain rule become matrix multiplication? In 1D, $h'(x) = g'(f(x)) \cdot f'(x)$ measures the rate of change of $g \circ f$ at $x$. In higher dimensions, the “rate of change” is a linear map. The total linear map from input displacements to output displacements is found by first applying the linear approximation of $f$ (sending $\delta \in \mathbb{R}^n$ to $J_f \delta \in \mathbb{R}^m$), then applying the linear approximation of $g$ (sending $J_f \delta \in \mathbb{R}^m$ to $J_g \cdot J_f \delta \in \mathbb{R}^p$). Composing two linear maps is matrix multiplication. The chain rule IS composition of linear maps.

Example. Let $f(x) = (x_1^2, x_2^2): \mathbb{R}^2 \to \mathbb{R}^2$ and $g(y) = y_1 + y_2: \mathbb{R}^2 \to \mathbb{R}$.

$h(x) = g(f(x)) = x_1^2 + x_2^2$.

Jacobian of $f$: $J_f = \begin{pmatrix} 2x_1 & 0 \\ 0 & 2x_2 \end{pmatrix}$ (a $2 \times 2$ matrix).

Jacobian of $g$: $J_g = (1, 1)$ (a $1 \times 2$ row vector).

By the chain rule: $J_h = J_g \cdot J_f = (1, 1) \begin{pmatrix} 2x_1 & 0 \\ 0 & 2x_2 \end{pmatrix} = (2x_1, 2x_2)$.

This is the gradient $\nabla h = (2x_1, 2x_2)$, which is correct.


Section 6: Backpropagation IS the Jacobian Chain Rule

A neural network with $L$ layers computes:

$$\text{output} = f_L \circ f_{L-1} \circ \cdots \circ f_2 \circ f_1(x)$$

The loss is:

$$\ell = \text{loss}(f_L \circ \cdots \circ f_1(x), y)$$

To train the network, you need $\frac{\partial \ell}{\partial \theta_i}$ for every parameter $\theta_i$ in every layer. By the chain rule:

$$J_\ell = J_{\text{loss}} \cdot J_{f_L} \cdot J_{f_{L-1}} \cdots J_{f_1}$$

This is a product of matrices. The total Jacobian of the loss is the product of Jacobians of each layer.

The forward-backward distinction. To compute this product, you can go left-to-right (forward mode) or right-to-left (backward mode).

Forward mode (left-to-right): $J_\ell = J_{\text{loss}} \cdot (J_{f_L} \cdot (J_{f_{L-1}} \cdots J_{f_1}))$. Compute from right to left, building up the product one layer at a time. Each intermediate product is a matrix. If the input dimension is $n$ and you are computing the gradient of $\ell$ (a scalar) with respect to all parameters, you need to carry a matrix at each step. Cost: one forward pass per input dimension $n$.

Backward mode (right-to-left): $(J_{\text{loss}} \cdot J_{f_L}) \cdot J_{f_{L-1}} \cdots J_{f_1}$. Start with $J_{\text{loss}}$ (a row vector, since loss is scalar), then multiply by $J_{f_L}$ on the right, and so on. At every step, you have a row vector - not a matrix. Cost: one backward pass for all parameters simultaneously.

Discomfort check. Why does the direction of matrix multiplication matter for cost? Consider a product $A \cdot B \cdot C$ where $A$ is $1 \times m$, $B$ is $m \times n$, and $C$ is $n \times p$. Computing $(A \cdot B) \cdot C$: first multiply $1 \times m$ by $m \times n$ to get $1 \times n$ (cost: $mn$), then multiply $1 \times n$ by $n \times p$ to get $1 \times p$ (cost: $np$). Total: $mn + np$. Computing $A \cdot (B \cdot C)$: first multiply $m \times n$ by $n \times p$ to get $m \times p$ (cost: $mnp$!), then multiply $1 \times m$ by $m \times p$. The second approach is much more expensive because you form a large intermediate matrix. Backpropagation is efficient precisely because the loss is scalar ($m = 1$): starting from the loss and going backward keeps all intermediate products as vectors, not matrices.

Backpropagation is backward mode automatic differentiation. The “backward pass” of neural network training is right-to-left evaluation of the Jacobian chain product, starting from the loss gradient (a scalar) and propagating it backward through each layer. At each layer, the local Jacobian is multiplied in from the left.

For a neural network with $N$ parameters, backward mode computes all $N$ partial derivatives $\frac{\partial \ell}{\partial \theta_i}$ in one backward pass with cost proportional to the forward pass. This is why training is feasible: $N$ can be $10^8$, but backpropagation still only takes twice the compute of the forward pass.


Section 7: The Gradient of a Matrix - Loss Derivatives in Practice

In machine learning, parameters are often organized as matrices. For a linear layer with weight matrix $W \in \mathbb{R}^{m \times n}$, the output is $y = Wx$. The loss $\ell$ depends on $y$ and hence on $W$.

What is $\frac{\partial \ell}{\partial W}$? This means: how does the scalar loss change as we vary each entry $W_{ij}$?

The answer is a matrix of the same shape as $W$: the $(i, j)$ entry of $\frac{\partial \ell}{\partial W}$ is $\frac{\partial \ell}{\partial W_{ij}}$.

Example: Linear regression loss. $\ell = |Wx - y|^2 = (Wx - y)^T(Wx - y)$.

We want $\frac{\partial \ell}{\partial W}$.

By the chain rule: $\frac{\partial \ell}{\partial W_{ij}} = \frac{\partial \ell}{\partial z_i} \cdot \frac{\partial z_i}{\partial W_{ij}}$, where $z = Wx - y$.

$\frac{\partial \ell}{\partial z_i} = 2z_i$ (since $\ell = \sum_k z_k^2$).

$\frac{\partial z_i}{\partial W_{ij}} = x_j$ (since $z_i = \sum_k W_{ik} x_k - y_i$, so the $j$-th term gives $x_j$).

Therefore: $\frac{\partial \ell}{\partial W_{ij}} = 2z_i x_j$.

In matrix form: $\frac{\partial \ell}{\partial W} = 2z x^T = 2(Wx - y)x^T$.

This is an outer product. The gradient with respect to $W$ is the outer product of the error vector $(Wx - y)$ with the input $x$. This is the update rule for linear regression gradient descent:

$$W \leftarrow W - \alpha \cdot 2(Wx - y)x^T$$

Each parameter $W_{ij}$ is adjusted proportionally to the prediction error in output $i$ times the corresponding input feature $j$. This is exactly the Hebbian learning intuition: connections between active inputs and erroneous outputs get the largest update.


Section 8: Divergence - When the Output Is a Vector Field

So far we have focused on Jacobians of functions $f: \mathbb{R}^n \to \mathbb{R}^m$. Now we consider functions $F: \mathbb{R}^n \to \mathbb{R}^n$ - vector fields, which assign a vector to each point in space.

Given a vector field $F = (F_1, F_2, F_3): \mathbb{R}^3 \to \mathbb{R}^3$, the Jacobian is a $3 \times 3$ matrix. Two special combinations of this matrix have names:

Divergence. The trace of the Jacobian:

$$\nabla \cdot F = \frac{\partial F_1}{\partial x} + \frac{\partial F_2}{\partial y} + \frac{\partial F_3}{\partial z}$$

The divergence measures how much the field “spreads out” at a point. If $\nabla \cdot F > 0$, the field has a source (fluid flowing outward). If $\nabla \cdot F < 0$, it has a sink. If $\nabla \cdot F = 0$ everywhere, the field is incompressible (volume is conserved).

In machine learning: the divergence appears in normalizing flows. If $x(t)$ evolves according to the ODE $\dot{x} = v_\theta(x, t)$, the log-density of $x(t)$ changes at the rate $-\nabla \cdot v_\theta$. Computing the full Jacobian to find the divergence costs $O(n^2)$ in general. Hutchinson’s trace estimator approximates the trace using random vectors: $\text{tr}(J) \approx \mathbb{E}[\epsilon^T J \epsilon]$ for $\epsilon \sim \mathcal{N}(0, I)$, reducing the cost to $O(n)$.

Curl. The antisymmetric part of the Jacobian (in 3D):

$$\nabla \times F = \left(\frac{\partial F_3}{\partial y} - \frac{\partial F_2}{\partial z}, \quad \frac{\partial F_1}{\partial z} - \frac{\partial F_3}{\partial x}, \quad \frac{\partial F_2}{\partial x} - \frac{\partial F_1}{\partial y}\right)$$

The curl measures the infinitesimal rotation of the field at a point. A field with zero curl everywhere is called conservative (or irrotational) - it is the gradient of some scalar potential, $F = \nabla \phi$. Conservative fields arise naturally in physics: gravitational fields and electrostatic fields are both conservative.


Section 9: The Laplacian - Divergence of the Gradient

The gradient $\nabla f$ of a scalar field is a vector field. Take the divergence of that vector field:

$$\Delta f = \nabla \cdot (\nabla f) = \sum_{i=1}^n \frac{\partial^2 f}{\partial x_i^2}$$

This is the Laplacian of $f$, written $\Delta f$ or $\nabla^2 f$.

The Laplacian measures how $f$ at a point compares to its average over nearby points. Specifically, $\Delta f(x) > 0$ means $f(x)$ is below its local average (the function wants to increase), and $\Delta f(x) < 0$ means $f(x)$ is above its local average.

Functions with $\Delta f = 0$ everywhere are harmonic - they sit at exactly their local average everywhere. Harmonic functions model steady-state temperature distributions, electrostatic potentials in charge-free space, and probability currents.

The graph Laplacian. For discrete data on a graph $G = (V, E)$ with adjacency matrix $A$ and degree matrix $D$, the analogous object is $L = D - A$. For a signal $x \in \mathbb{R}^{|V|}$ (a value at each node):

$$x^T L x = \sum_{(i,j) \in E} (x_i - x_j)^2$$

This is the total variation of the signal: how much it varies between connected nodes. Minimizing $x^T L x$ encourages smooth signals (nearby nodes have similar values). This is the discrete analog of the continuous Dirichlet energy $\int |\nabla f|^2$.

Graph neural networks use $L$ to aggregate neighborhood information. Spectral graph convolutions multiply in the eigenbasis of $L$ - the “graph Fourier transform” - analogous to how classical convolution multiplies in the Fourier basis (eigenbasis of the continuous Laplacian).


Section 10: The Inverse Function Theorem

For a function $f: \mathbb{R} \to \mathbb{R}$, if $f'(a) \neq 0$ at some point $a$, then $f$ is locally invertible near $a$: there is a neighborhood of $a$ on which $f$ is one-to-one and onto its image.

For $f: \mathbb{R}^n \to \mathbb{R}^n$, the analog is:

Inverse Function Theorem. If $f: \mathbb{R}^n \to \mathbb{R}^n$ is continuously differentiable and $J_f(a)$ is invertible (has nonzero determinant) at $a$, then $f$ is locally invertible near $a$, and the derivative of the local inverse is $(J_f(a))^{-1}$.

The mapping: “$f'(a) \neq 0$” becomes “$J_f(a)$ is invertible (full rank)”.

This theorem is fundamental in differential geometry and physics: it tells you when a change of variables is valid, when a system of equations can be solved locally, and when a neural network layer is locally reversible.

Discomfort check. Why does the invertibility of $J_f$ at one point imply invertibility of $f$ in a neighborhood? Because $J_f$ is the best linear approximation to $f$ near the point. If the linear approximation is invertible, then for small perturbations, the full function is approximately linear, and approximately linear with an invertible matrix is invertible. More precisely, if $J_f(a)$ is invertible, the linear map $h \mapsto J_f(a)h$ stretches every direction. By continuity of $f$, for small enough perturbations, $f$ also stretches every direction, ensuring it is one-to-one.


Section 11: Forward vs. Backward Mode - Why Both Matter

We established that backpropagation uses backward mode (right-to-left Jacobian multiplication). But forward mode (left-to-right) is sometimes better.

Forward mode is efficient when you need to compute $J_f v$ for a specific vector $v$ (a Jacobian-vector product). Cost: one forward pass. Useful when:

  • The function has few inputs and many outputs ($n \ll m$): to compute $J_f$ column by column, each column costs one forward pass, for $n$ total passes.
  • You want directional derivatives: $J_f v = D_v f$, the derivative of $f$ in direction $v$.

Backward mode is efficient when you need to compute $u^T J_f$ for a specific vector $u$ (a vector-Jacobian product). Cost: one backward pass. Useful when:

  • The function has many inputs and few outputs ($m \ll n$): to compute the gradient $\nabla \ell = J_\ell$ (where $\ell$ is scalar), one backward pass gives all $n$ partial derivatives simultaneously.
  • This is exactly the neural network training case: millions of parameters ($n$), scalar loss ($m = 1$).

Higher-order derivatives. The Hessian $H$ is the Jacobian of the gradient $\nabla f$. To compute $Hv$ (a Hessian-vector product):

  • Use forward-over-backward: run backward mode to get $\nabla f$, then run forward mode to differentiate $\nabla f$ in direction $v$.
  • Cost: two passes through the network (one forward, one backward).

This is used in Newton-CG methods that solve $Hd = -\nabla f$ approximately using conjugate gradient, each iteration of which requires one Hessian-vector product.


Section 12: Score Functions and the Gradient of a Log-Density

A pattern that appears throughout probabilistic ML: the score function $\nabla_x \log p(x)$.

For a probability density $p(x)$, the score is the gradient of the log-density with respect to the data point $x$. It points in the direction of increasing probability - toward high-probability regions.

In diffusion models. Diffusion models (DDPM, score matching) learn to estimate $s_\theta(x) \approx \nabla_x \log p(x)$. Given a noisy observation $\tilde{x} = x + \sigma \epsilon$:

  • The score of the noise distribution is $\nabla_{\tilde{x}} \log p(\tilde{x}|x) = -\epsilon/\sigma$.
  • A network $s_\theta$ is trained to estimate this score.
  • At sampling time, the score field is used to reverse the diffusion: $x_{t-1} = x_t + \alpha s_\theta(x_t) + \text{noise}$.

The score function is a vector field on data space. Its divergence appears in the Fokker-Planck equation governing how probability density evolves. Its geometry (level sets, gradient flow) determines the structure of the learned data distribution.

Stein’s identity. For any smooth function $\phi(x)$ and a distribution $p(x)$ with score $s(x) = \nabla_x \log p(x)$:

$$\mathbb{E}_p[\phi(x) s(x)^T + \nabla_x \phi(x)] = 0$$

This identity connects the score (a first-order object) to expectations of functions of $x$. It is used in score matching to estimate $s(x)$ without access to the normalizing constant of $p(x)$.


Section 13: The Jacobian Determinant and Change of Variables

For a map $f: \mathbb{R}^n \to \mathbb{R}^n$ (square Jacobian), the determinant of the Jacobian plays a special role: it measures how the map scales volume.

If the Jacobian at a point $x$ is $J_f(x)$, then a small $n$-dimensional cube of volume $dV$ near $x$ gets mapped to a region of volume $|\det J_f(x)| \cdot dV$ near $f(x)$.

  • $|\det J_f| > 1$: the map expands volume locally.
  • $|\det J_f| < 1$: the map compresses volume locally.
  • $|\det J_f| = 0$: the map is locally degenerate - it collapses volume (not locally invertible).
  • $|\det J_f| = 1$: the map preserves volume (volume-preserving or measure-preserving).

The change-of-variables formula for integration:

$$\int_{f(U)} g(y) dy = \int_U g(f(x)) |\det J_f(x)| dx$$

This generalizes the single-variable substitution rule $\int g(f(x)) f'(x) dx = \int g(y) dy$: the single-variable term $|f'(x)|$ becomes $|\det J_f(x)|$.

In machine learning: normalizing flows. A normalizing flow learns an invertible map $f_\theta: \mathbb{R}^n \to \mathbb{R}^n$ that transforms a simple base distribution (Gaussian) into a complex data distribution. The change-of-variables formula gives:

$$\log p_X(x) = \log p_Z(f_\theta^{-1}(x)) + \log |\det J_{f_\theta^{-1}}(x)|$$

The log-probability of a data point $x$ equals the log-probability of its preimage under the base distribution, plus a correction term accounting for how the map expands or compresses volume. Training the flow requires computing $\log |\det J|$, which costs $O(n^3)$ in general.

Specially structured flows choose $f$ so that $J$ is triangular (lower or upper): the determinant of a triangular matrix is just the product of diagonal entries, computable in $O(n)$. Real NVP, GLOW, and many other architectures use this trick.


Section 14: Jacobians in Attention - Gradients Through Softmax

The self-attention mechanism in transformers computes:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d}}\right) V$$

The softmax is applied to each row of the score matrix $A = QK^T / \sqrt{d}$, giving attention weights $S = \text{softmax}(A)$. The output is $O = SV$.

To train the transformer, you need $\frac{\partial L}{\partial Q}$, $\frac{\partial L}{\partial K}$, and $\frac{\partial L}{\partial V}$, where $L$ is the loss.

Gradient through $V$: $O = SV$, so $\frac{\partial L}{\partial V} = S^T \frac{\partial L}{\partial O}$. This is straightforward matrix backprop.

Gradient through the attention weights: $O = SV$ gives $\frac{\partial L}{\partial S} = \frac{\partial L}{\partial O} V^T$. Then you need to backpropagate through the softmax.

For a single row $s = \text{softmax}(a)$ with $a \in \mathbb{R}^n$, the Jacobian is:

$$J_{\text{softmax}} = \text{diag}(s) - s s^T$$

(derived in Section 4). Applying this Jacobian: given incoming gradient $g = \frac{\partial L}{\partial s}$, the gradient with respect to $a$ is:

$$\frac{\partial L}{\partial a} = J_{\text{softmax}}^T g = \text{diag}(s) g - s s^T g = s \odot g - s(s \cdot g)$$

where $\odot$ is element-wise multiplication. In words: subtract the weighted average $s \cdot g$ from each component.

This formula is efficient: it costs $O(n)$ per row (not $O(n^2)$ for forming the full Jacobian). The Jacobian is never explicitly stored - only the Jacobian-vector product is computed, directly from the softmax output $s$ and the upstream gradient $g$.

This is the general pattern in deep learning: never form the Jacobian explicitly, only compute Jacobian-vector products as needed during backpropagation.

Discomfort check. If you never form the Jacobian explicitly, in what sense does it exist? It exists mathematically as the derivative (the best linear approximation) of each layer. The backpropagation algorithm implicitly evaluates products of Jacobians from right to left - it never stores them but uses the formula for each Jacobian-vector product as needed. The Jacobian is real; what you skip is materializing it in memory.


Section 15: Implicit Differentiation and Implicit Layers

The chain rule through a Jacobian applies whenever you can describe the computation explicitly: output as a function of input. But what if the computation is defined implicitly?

The implicit function theorem. Suppose $g(x, y) = 0$ defines $y$ implicitly as a function of $x$ near a point $(x_0, y_0)$ where $\frac{\partial g}{\partial y}$ is invertible. Then:

$$\frac{\partial y}{\partial x} = -\left(\frac{\partial g}{\partial y}\right)^{-1} \frac{\partial g}{\partial x}$$

In single-variable calculus, this reduces to $dy/dx = -(g_y)^{-1} g_x = -g_x/g_y$, which is the implicit differentiation formula you know.

Example. $g(x, y) = x^2 + y^2 - 1 = 0$ (the unit circle). Then $g_x = 2x$ and $g_y = 2y$. The implicit derivative: $dy/dx = -2x / 2y = -x/y$. Matches what you compute by implicit differentiation.

In machine learning: deep equilibrium models. A deep equilibrium model (DEQ) defines its output $z$ as the fixed point of a map: $z = f_\theta(z, x)$. This is an implicit equation $g(z, x, \theta) = z - f_\theta(z, x) = 0$.

To train the model, you need $\frac{\partial z}{\partial \theta}$ (how the fixed point changes as parameters change). By the implicit function theorem:

$$\frac{\partial z}{\partial \theta} = -\left(\frac{\partial g}{\partial z}\right)^{-1} \frac{\partial g}{\partial \theta} = (I - J_f)^{-1} \frac{\partial f}{\partial \theta}$$

where $J_f = \frac{\partial f}{\partial z}$ is the Jacobian of $f$ with respect to its first argument. This formula says: backpropagating through an infinite-depth network (the fixed-point equation) is equivalent to solving a linear system $(I - J_f)v = \frac{\partial f}{\partial \theta}$.

In practice, this linear system is solved approximately using a fixed-point iteration, making DEQs computationally similar in cost to a fixed-depth network.

The key insight. You do not need to know how $y$ depends on $x$ explicitly to compute $dy/dx$ - you only need the Jacobians of $g$ evaluated at the solution. The chain rule extends to implicit computations through the implicit function theorem.


Section 16: Gradient Flow and Continuous-Time Optimization

Gradient descent is a discrete-time algorithm. What is the continuous-time limit?

Define a function $x(t)$ that evolves by the ordinary differential equation:

$$\frac{dx}{dt} = -\nabla f(x(t))$$

This is the gradient flow - the path of steepest descent, traveled continuously. As $t \to \infty$, $x(t)$ approaches a critical point of $f$ (where $\nabla f = 0$).

The connection to gradient descent: gradient descent with step size $\alpha$ approximates the gradient flow by Euler discretization:

$$x(t + \alpha) \approx x(t) + \alpha \frac{dx}{dt}\bigg|_{t} = x(t) - \alpha \nabla f(x(t))$$

Smaller step sizes give better approximations to the continuous flow.

Why continuous-time is useful for analysis. The gradient flow satisfies:

$$\frac{d}{dt} f(x(t)) = \nabla f(x(t)) \cdot \frac{dx}{dt} = \nabla f \cdot (-\nabla f) = -|\nabla f|^2 \leq 0$$

The loss decreases monotonically along the gradient flow. (This uses the chain rule for partial derivatives from Section 9 of the Gradients post.) For gradient descent with a finite step size, the loss can temporarily increase if the step size is too large.

The Polyak-Lojasiewicz condition. If the function satisfies $|\nabla f(x)|^2 \geq \mu (f(x) - f^\ast)$ for some $\mu > 0$ and global minimum $f^\ast$, then the gradient flow converges exponentially:

$$f(x(t)) - f^\ast \leq (f(x(0)) - f^\ast) e^{-2\mu t}$$

This is proved by applying the chain rule to $f(x(t)) - f^\ast$:

$$\frac{d}{dt}[f(x(t)) - f^\ast] = -|\nabla f|^2 \leq -\mu(f(x(t)) - f^\ast)$$

Then Gronwall’s inequality gives the exponential bound.

This framework connects optimization theory to dynamical systems. The behavior of gradient descent - convergence rates, oscillations, sensitivity to learning rate - can be analyzed using tools from ODEs and control theory applied to the gradient flow.


Section 17: Putting It All Together - The Derivative at Every Scale

We started with the question: what is the derivative of a function from $\mathbb{R}^n$ to $\mathbb{R}^m$?

The answer: it is the Jacobian, the $m \times n$ matrix of all partial derivatives. It is the best linear approximation to the function near a point. It is a linear map from input space to output space.

And this is the same answer at every scale:

  • $f: \mathbb{R} \to \mathbb{R}$: Jacobian is $1 \times 1$ - a number.
  • $f: \mathbb{R}^n \to \mathbb{R}$: Jacobian is $1 \times n$ - the gradient (transposed).
  • $f: \mathbb{R}^n \to \mathbb{R}^m$: Jacobian is $m \times n$ - the full Jacobian matrix.

The chain rule multiplies Jacobians as matrices, generalizing scalar multiplication. Backpropagation is backward-mode Jacobian chain evaluation, efficient because the loss is scalar. The gradient of a matrix-valued parameter is an outer product. The Laplacian is the trace of the Jacobian of the gradient.

All of vector calculus for machine learning flows from this one unified view: the derivative is always the best linear approximation, always representable as a matrix, always composable by matrix multiplication.


Summary

Concept Single-Variable Multivariable (General)
Derivative of $f: \mathbb{R}^n \to \mathbb{R}^m$ $f'(x)$ - a number ($1 \times 1$ matrix) $J_f(x)$ - an $m \times n$ matrix
Best linear approximation $f(x+h) \approx f(x) + f'(x) h$ $f(x + \delta) \approx f(x) + J_f(x) \delta$
Chain rule $h'(x) = g'(f(x)) \cdot f'(x)$ $J_h(x) = J_g(f(x)) \cdot J_f(x)$ (matrix product)
Gradient (scalar output) $f'(x)$ $\nabla f = J_f^T$ (transposed Jacobian)
Local invertibility condition $f'(a) \neq 0$ $J_f(a)$ invertible (inverse function theorem)
Backpropagation Right-to-left multiplication Right-to-left Jacobian chain multiplication
Divergence $f'(x)$ (scalar functions have no analog) $\text{tr}(J_F)$ for vector fields $F$
Laplacian $f''(x)$ (of the gradient) $\text{tr}(J_{\nabla f}) = \sum_i \partial^2 f / \partial x_i^2$
Parameter gradient $\partial \ell / \partial W$ $\partial \ell / \partial w$ Outer product $(J_\ell)^T x^T$ for linear layers

Every concept in vector calculus for ML reduces to one idea: the derivative is a linear map. Express it as a matrix, compose it by multiplication, evaluate it efficiently by choosing the right direction of traversal.


Read Next: