Gradients & Partial Derivatives // Megha Bose

Prerequisite:

Derivatives & Geometry of Change

In single-variable calculus, the derivative measures how a function changes as its one input changes. In $n$ dimensions there are $n$ input directions, and the resulting objects - partial derivatives, the gradient, the Jacobian - encode how the output changes along each of them. These objects are the backbone of every gradient-based optimization algorithm used in machine learning.

Partial Derivatives

Let $f: \mathbb{R}^n \to \mathbb{R}$ and let $e_i$ be the $i$-th standard basis vector. The partial derivative of $f$ with respect to $x_i$ at a point $x$ is

$$\frac{\partial f}{\partial x_i}(x) = \lim_{h \to 0} \frac{f(x + h e_i) - f(x)}{h}$$

provided this limit exists. This is just the ordinary derivative of the function $t \mapsto f(x_1, \ldots, x_{i-1}, x_i + t, x_{i+1}, \ldots, x_n)$ evaluated at $t = 0$ - all other inputs are held constant.

Caution. Existence of all partial derivatives at a point does not imply continuity or even existence of the directional derivative in other directions. The classic counterexample is

$$f(x, y) = \begin{cases} \frac{xy}{x^2 + y^2} & (x,y) \neq (0,0) \ 0 & (x,y) = (0,0) \end{cases}$$

which has $\partial f/\partial x(0,0) = \partial f/\partial y(0,0) = 0$ but is not continuous at the origin.

The Gradient

For $f: \mathbb{R}^n \to \mathbb{R}$ with all partial derivatives existing at $x$, the gradient is the column vector

$$\nabla f(x) = \left(\frac{\partial f}{\partial x_1}(x),\ \frac{\partial f}{\partial x_2}(x),\ \ldots,\ \frac{\partial f}{\partial x_n}(x)\right)^T \in \mathbb{R}^n$$

When $f$ is differentiable at $x$, the gradient points in the direction of steepest ascent of $f$. More precisely, among all unit vectors $v$, the directional derivative $D_v f(x) = \nabla f(x)^T v$ is maximized by $v = \nabla f(x) / |\nabla f(x)|$, with maximum value $|\nabla f(x)|$.

Gradient vector field on contour lines of f(x,y):

   y
   |      .  .  .  .
   |   . /  /  /  . .
   |  . /  /  /  /  .
   |   /  /  /  /
   |  / ->/ ->/->
   | /   /   /   
   |/  _/_ _/_ ___     <- contours (level sets of f)
   +-------------------> x
      arrows = gradient vectors (perpendicular to contours)

The gradient is always perpendicular to the level sets ${x : f(x) = c}$. This is not a coincidence: if $\gamma(t)$ is a curve with $f(\gamma(t)) = c$, differentiating gives $\nabla f(\gamma(t))^T \gamma'(t) = 0$, so the gradient is orthogonal to every tangent direction of the level set.

Directional Derivative

The directional derivative of $f$ at $x$ in the direction $v \in \mathbb{R}^n$ is

$$D_v f(x) = \lim_{h \to 0} \frac{f(x + hv) - f(x)}{h}$$

Theorem. If $f$ is differentiable at $x$, then for any $v$,

$$D_v f(x) = \nabla f(x)^T v$$

Differentiability (existence of the Fréchet derivative, defined below) is essential here; existence of all directional derivatives is not sufficient.

Total Derivative and the Fréchet Derivative

For $f: \mathbb{R}^n \to \mathbb{R}^m$, the Fréchet derivative (total derivative) at $x$ is the linear map $L: \mathbb{R}^n \to \mathbb{R}^m$ satisfying

$$\lim_{|h| \to 0} \frac{|f(x+h) - f(x) - Lh|}{|h|} = 0$$

When such $L$ exists, it is unique, and $f$ is said to be differentiable at $x$. The matrix representing $L$ with respect to the standard bases is the Jacobian.

Theorem. If all partial derivatives of $f$ exist and are continuous in a neighborhood of $x$, then $f$ is Fréchet differentiable at $x$ (i.e., $C^1$ implies differentiable).

Jacobian Matrix

For $f = (f_1, \ldots, f_m): \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian is the $m \times n$ matrix

$$(J_f)_{ij}(x) = \frac{\partial f_i}{\partial x_j}(x)$$

Explicitly:

$$J_f(x) = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \ \vdots & \ddots & \vdots \ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{pmatrix}$$

When $m = 1$, the Jacobian is the gradient transposed: $J_f = \nabla f^T$ (a row vector). When $m = n$, $\det(J_f)$ is the Jacobian determinant used in change-of-variables for integration.

Clairaut’s Theorem (Symmetry of Mixed Partials)

Theorem (Clairaut–Schwarz). Let $f: \mathbb{R}^n \to \mathbb{R}$ and suppose that $\frac{\partial^2 f}{\partial x_i \partial x_j}$ and $\frac{\partial^2 f}{\partial x_j \partial x_i}$ both exist and are continuous at $x$. Then

$$\frac{\partial^2 f}{\partial x_i \partial x_j}(x) = \frac{\partial^2 f}{\partial x_j \partial x_i}(x)$$

Continuity of the mixed partials is genuinely needed; without it, mixed partials can disagree. As a consequence, the Hessian matrix $H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$ is symmetric for any $C^2$ function.

Chain Rule in Multiple Variables

Theorem (Multivariable chain rule). Let $g: \mathbb{R}^n \to \mathbb{R}^m$ be differentiable at $x$, and let $f: \mathbb{R}^m \to \mathbb{R}^p$ be differentiable at $g(x)$. Then $f \circ g: \mathbb{R}^n \to \mathbb{R}^p$ is differentiable at $x$ and

$$J_{f \circ g}(x) = J_f(g(x)) \cdot J_g(x)$$

where $\cdot$ denotes matrix multiplication. For the scalar case $f: \mathbb{R}^n \to \mathbb{R}$ and $g: \mathbb{R}^k \to \mathbb{R}^n$:

$$\frac{\partial (f \circ g)}{\partial t_j}(t) = \sum_{i=1}^{n} \frac{\partial f}{\partial x_i}(g(t)) \cdot \frac{\partial g_i}{\partial t_j}(t) = \nabla f(g(t))^T \frac{\partial g}{\partial t_j}(t)$$

This is the foundation of backpropagation: for a composition $L = \ell \circ f_K \circ \cdots \circ f_1$, the gradient with respect to any layer’s parameters is a product of Jacobians accumulated by the chain rule.

The Gradient as a Covector

There is a more coordinate-free way to understand the gradient. The differential of $f$ at $x$ is the linear functional

$$df_x: \mathbb{R}^n \to \mathbb{R}, \quad df_x(v) = \lim_{h \to 0} \frac{f(x+hv) - f(x)}{h}$$

This is a covector (element of the dual space $(\mathbb{R}^n)^\ast$, also called a differential 1-form). It is coordinate-free. The gradient $\nabla f(x)$ is the representation of this covector as a vector via the standard inner product: $df_x(v) = \langle \nabla f(x), v \rangle$.

This distinction matters when the space has a non-Euclidean metric $g$: in that case $\nabla_g f(x) = g^{-1} df_x$, which is the natural gradient used in information geometry and Fisher-Rao optimization.

Examples

Gradient descent. The iteration $x_{k+1} = x_k - \eta \nabla f(x_k)$ with step size $\eta > 0$ decreases $f$ locally because $f(x_k - \eta \nabla f(x_k)) \approx f(x_k) - \eta |\nabla f(x_k)|^2 < f(x_k)$ for small enough $\eta$ (provided $\nabla f(x_k) \neq 0$). The gradient is the workhorse of every first-order optimizer (SGD, Adam, RMSProp).

Computing gradients in PyTorch/JAX. Modern frameworks compute gradients via reverse-mode automatic differentiation (backpropagation). The computation graph of $f$ is stored during the forward pass; the backward pass traverses it in reverse, accumulating the Jacobian-vector products $J^T v$ (vector-Jacobian products) needed for the gradient.

Forward-mode AD evaluates $J_f(x) v$ for a fixed $v$ in a single forward pass. Cost: one forward pass per input dimension. Efficient when $n \ll m$.
Reverse-mode AD evaluates $J_f(x)^T u$ for a fixed $u$ in a single backward pass. Cost: one backward pass per output dimension. Efficient when $m \ll n$ - which is the case for scalar losses in neural networks ($m = 1$, $n$ = millions of parameters).

JAX implements both modes and allows composing them for higher-order derivatives (e.g., jax.hessian = jax.jacfwd(jax.jacrev(f))).

Read Next: