Gradients & Partial Derivatives - Slopes in Every Direction at Once
Helpful context:
- Derivatives - The Geometry of Instantaneous Change
- Functions & Mappings - One Input, One Output, No Exceptions
Imagine you are standing inside a large building and you want to know how warm or cold it is. You take a thermometer reading: say, 21 degrees Celsius. But that is just the temperature at the spot where you are standing. Walk three meters to your right and it might be 23 degrees, because you are closer to a heating vent. Walk forward two meters and it might be 19 degrees, because you are near an exterior wall. Temperature in a room is not a single number. It is a function of where you are.
Write it as $T(x, y, z)$: the temperature at position $(x, y, z)$ in the room. This is a function from $\mathbb{R}^3$ to $\mathbb{R}$. Three inputs, one output.
Now you want to walk somewhere warmer. You have a natural question: which direction should I walk to feel the most warming, most quickly? And a related question: if I walk in a specific direction - say, northeast - how fast does the temperature change?
These are the questions that partial derivatives and the gradient answer. And as we will see, the answers connect back to everything you already know about single-variable derivatives - just extended into more dimensions.
Section 1: The Single-Variable Foundation
Before we extend to multiple variables, recall what the derivative means for $f: \mathbb{R} \to \mathbb{R}$.
You have a function $f(x)$ - one input, one output. The derivative at a point $x$ is the number $f'(x)$, defined as:
$$f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h}$$
This number is the slope of the tangent line to the graph of $f$ at $x$. It measures how fast $f$ is changing at $x$: positive means $f$ is increasing, negative means decreasing, large magnitude means fast change, small magnitude means slow change.
The key feature: $f'(x)$ is a single number. There is only one direction to move along the real line (left or right), so there is only one rate of change.
For $f(x) = x^2$, we have $f'(x) = 2x$. At $x = 3$, the derivative is $6$: the function is increasing at a rate of 6 units per unit of $x$.
This is the foundation. Now let us ask: what happens when the function has two inputs?
Section 2: Two Inputs - The New Complication
Consider $f(x, y) = x^2 + y^2$. This is the distance squared from the origin. The output is one number, but the inputs form a 2D plane.
At the point $(1, 1)$, $f(1, 1) = 2$.
Now you nudge the input. But which way? You could:
- Move in the $x$-direction: go from $(1, 1)$ to $(1 + h, 1)$.
- Move in the $y$-direction: go from $(1, 1)$ to $(1, 1 + h)$.
- Move diagonally: go from $(1, 1)$ to $(1 + h, 1 + h)$.
- Move in any other direction you choose.
Each direction gives a different rate of change. “How fast is $f$ changing?” is no longer a single question - it depends on which direction you are asking about.
This is the essential new feature of multivariable calculus: the rate of change depends on the direction.
Discomfort check. In single-variable calculus, the derivative gave us one number for each point. Why should we now need many numbers? The answer is that in single-variable calculus, you had only two directions to move (positive and negative), and they gave equal and opposite rates of change, so one number (the slope) captured everything. In $\mathbb{R}^2$, there are infinitely many directions. Each requires its own answer.
Section 3: Partial Derivatives - Freezing All But One
The simplest approach to a multivariable function: handle one variable at a time, freeze the rest.
The partial derivative with respect to $x_1$: set all other variables to fixed values and differentiate only with respect to $x_1$, exactly as you would in single-variable calculus.
Formally, for $f: \mathbb{R}^n \to \mathbb{R}$, the partial derivative of $f$ with respect to $x_i$ at the point $(x_1, \ldots, x_n)$ is:
$$\frac{\partial f}{\partial x_i} = \lim_{h \to 0} \frac{f(x_1, \ldots, x_{i-1}, x_i + h, x_{i+1}, \ldots, x_n) - f(x_1, \ldots, x_n)}{h}$$
In words: hold all variables except $x_i$ constant, and compute the ordinary single-variable derivative in $x_i$.
The symbol $\partial$ (a stylized “d”) indicates a partial derivative. Read $\frac{\partial f}{\partial x}$ as “partial $f$ partial $x$.”
Example 1. Let $f(x, y) = x^2 + y^2$.
To find $\frac{\partial f}{\partial x}$: treat $y$ as a constant, differentiate in $x$.
$$\frac{\partial f}{\partial x} = 2x$$
To find $\frac{\partial f}{\partial y}$: treat $x$ as a constant, differentiate in $y$.
$$\frac{\partial f}{\partial y} = 2y$$
At the point $(1, 1)$: $\frac{\partial f}{\partial x} = 2$, $\frac{\partial f}{\partial y} = 2$.
Example 2. Let $f(x, y) = x^2 y + \sin(xy)$.
$$\frac{\partial f}{\partial x} = 2xy + y\cos(xy)$$
Here, when differentiating with respect to $x$, the term $x^2 y$ becomes $2xy$ (treating $y$ as a constant, so $y$ stays as a factor), and the term $\sin(xy)$ becomes $y\cos(xy)$ by the chain rule (the inner function is $xy$, whose derivative with respect to $x$ is $y$).
$$\frac{\partial f}{\partial y} = x^2 + x\cos(xy)$$
Here, differentiating $x^2 y$ with respect to $y$ gives $x^2$ (since $x^2$ is the constant coefficient of $y$), and $\sin(xy)$ gives $x\cos(xy)$.
Example 3. Let $f(x, y, z) = x^2 yz + e^{xz}$.
$$\frac{\partial f}{\partial x} = 2xyz + ze^{xz}$$ $$\frac{\partial f}{\partial y} = x^2 z$$ $$\frac{\partial f}{\partial z} = x^2 y + xe^{xz}$$
Notice: $\frac{\partial f}{\partial y} = x^2 z$ because none of the terms involve $y$ alone - only $x^2 yz$ has $y$ in it, and its $y$-derivative is $x^2 z$.
Discomfort check. “Freezing other variables” sounds ad hoc. Why is this the right thing to do? Because it answers a precise question: how does $f$ change if I move only in the $x$-direction, holding everything else fixed? This is literally what partial differentiation computes. If you are standing in a room and you want to know how fast temperature changes as you walk east (without going north/south or up/down), you hold the other coordinates fixed and differentiate only in the east direction. The operation matches the question.
Section 4: The Gradient - Collecting All Partial Derivatives
For $f: \mathbb{R}^n \to \mathbb{R}$, there are $n$ partial derivatives - one per coordinate direction. Collect them into a single vector:
$$\nabla f(x) = \left(\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \ldots, \frac{\partial f}{\partial x_n}\right)$$
This vector is called the gradient of $f$ at the point $x$, written $\nabla f$ (read “del $f$” or “grad $f$").
The mapping to trace: In single-variable calculus, $f: \mathbb{R} \to \mathbb{R}$ has a derivative $f'(x)$, which is a single number. In multivariable calculus, $f: \mathbb{R}^n \to \mathbb{R}$ has a gradient $\nabla f(x)$, which is a vector in $\mathbb{R}^n$. The gradient is the vector-valued analog of the scalar derivative.
Both measure “rate of change.” But the rate of change now has a direction - because there are infinitely many directions to move in $\mathbb{R}^n$, and the gradient encodes information about all of them simultaneously.
Example. For $f(x, y) = x^2 + y^2$:
$$\nabla f(x, y) = (2x, 2y)$$
At $(1, 1)$: $\nabla f = (2, 2)$. At $(3, 0)$: $\nabla f = (6, 0)$. At $(0, 0)$: $\nabla f = (0, 0)$.
For the temperature function $T(x, y, z)$:
$$\nabla T = \left(\frac{\partial T}{\partial x}, \frac{\partial T}{\partial y}, \frac{\partial T}{\partial z}\right)$$
This is a vector in 3D space that points in the direction of warmest increase.
Discomfort check. The gradient $\nabla f(x)$ is a vector in the domain space $\mathbb{R}^n$, not in the range space $\mathbb{R}$. For $f: \mathbb{R}^2 \to \mathbb{R}$, the gradient is a 2D vector, not a real number. This is worth pausing on. The gradient lives in the same space as the inputs, not the outputs. Think of it as: the gradient is an arrow attached to each input point, indicating which way to move (in input space) to increase the output fastest.
Section 5: Directional Derivatives - The Full Rate of Change
The gradient tells us the rate of change in each coordinate direction. But what about an arbitrary direction - say, northeast, or at some angle?
Let $v$ be a unit vector in $\mathbb{R}^n$ (so $|v| = 1$). The directional derivative of $f$ at $x$ in the direction $v$ is:
$$D_v f(x) = \lim_{h \to 0} \frac{f(x + hv) - f(x)}{h}$$
This is the rate of change of $f$ as you move from $x$ in the direction $v$.
Theorem. If $f$ is differentiable at $x$, then for any unit vector $v$:
$$D_v f(x) = \nabla f(x) \cdot v$$
where $\cdot$ is the ordinary dot product.
This is the central theorem about gradients, and it deserves emphasis: the directional derivative in any direction is just the dot product of the gradient with that direction vector.
Verification on coordinate directions. The standard basis vector $e_i$ points in the $x_i$-direction. The directional derivative in direction $e_i$ is:
$$D_{e_i} f(x) = \nabla f(x) \cdot e_i = \frac{\partial f}{\partial x_i}$$
The partial derivatives are special cases - they are directional derivatives in the coordinate directions. The gradient unifies all of them.
Example. For $f(x, y) = x^2 + y^2$ at $(1, 1)$, we have $\nabla f = (2, 2)$.
What is the rate of change in the direction $v = (1/\sqrt{2}, 1/\sqrt{2})$ (northeast)?
$$D_v f = (2, 2) \cdot (1/\sqrt{2}, 1/\sqrt{2}) = 2/\sqrt{2} + 2/\sqrt{2} = 2\sqrt{2} \approx 2.83$$
What about in the direction $v = (1, 0)$ (east)?
$$D_v f = (2, 2) \cdot (1, 0) = 2$$
And due south, $v = (0, -1)$?
$$D_v f = (2, 2) \cdot (0, -1) = -2$$
Moving south decreases $f$ at a rate of 2 per unit distance.
Section 6: Why the Gradient Points Uphill
Now comes the key geometric fact.
Theorem. Among all unit vectors $v$, the directional derivative $D_v f(x) = \nabla f(x) \cdot v$ is maximized when $v$ points in the same direction as $\nabla f(x)$.
Proof. By the Cauchy-Schwarz inequality:
$$\nabla f \cdot v \leq |\nabla f| \cdot |v| = |\nabla f|$$
with equality when $v = \nabla f / |\nabla f|$ (assuming $\nabla f \neq 0$). So the maximum directional derivative is $|\nabla f|$, achieved by moving in the gradient direction. $\square$
In words: moving in the direction of the gradient gives the steepest increase in $f$. Moving opposite to the gradient gives the steepest decrease. Moving perpendicular to the gradient gives zero change.
The gradient points uphill. This is not just a memorable phrase - it is the geometric content of the theorem. The gradient is the direction of steepest ascent. Its magnitude tells you how steep that ascent is.
Back to the room: the gradient of the temperature function $\nabla T$ at your location points in the direction you should walk to get warm the fastest. And $|\nabla T|$ tells you how fast the temperature will increase per meter of walking in that direction.
Level sets are perpendicular to the gradient. A level set of $f$ is the set of all inputs where $f$ has the same value: $\{x : f(x) = c\}$. In 2D, level sets are curves (level curves). In 3D, level sets are surfaces (level surfaces).
If you move along a level set, $f$ does not change. So the rate of change of $f$ in any direction tangent to the level set is zero. But $D_v f = \nabla f \cdot v = 0$ means $v$ is perpendicular to $\nabla f$. So the gradient is perpendicular to all directions tangent to the level set - meaning it is perpendicular to the level set itself.
For the temperature function: the level curves are lines of equal temperature (isotherms). The gradient at any point is perpendicular to the local isotherm, pointing toward warmer regions.
Section 7: Gradient Descent
Here is the most important application: you want to minimize a function $f: \mathbb{R}^n \to \mathbb{R}$.
In single-variable calculus, to minimize $f(x)$, you find where $f'(x) = 0$. For complicated functions on high-dimensional spaces, you cannot just solve for the zero of $\nabla f$ algebraically - there are too many variables and the equations are nonlinear.
Instead, use the geometry: the gradient points uphill, so the negative gradient points downhill. To decrease $f$, move in the direction $-\nabla f$.
Gradient descent is the algorithm:
$$x_{\text{new}} = x_{\text{old}} - \alpha \nabla f(x_{\text{old}})$$
where $\alpha > 0$ is a small number called the learning rate (or step size).
At each step: compute the gradient at the current point, move a small amount in the direction of steepest descent, repeat.
The single-variable analog. For $f: \mathbb{R} \to \mathbb{R}$, this reduces to:
$$x_{\text{new}} = x_{\text{old}} - \alpha f'(x_{\text{old}})$$
You already know this: if $f'(x) > 0$, $f$ is increasing, so you move left (negative direction) to decrease $f$. If $f'(x) < 0$, you move right. Gradient descent in 1D is just following the slope downhill.
The only difference in $n$ dimensions: the slope (scalar) becomes the gradient (vector). Everything else is the same.
Why this is important. Training a neural network means minimizing a loss function $L(\theta)$ where $\theta$ is a vector of millions of parameters. The loss function has no closed-form minimizer. Gradient descent (or its stochastic variant, SGD) is how every neural network is trained. The gradient $\nabla_\theta L$ tells the optimizer how to adjust each parameter to reduce the loss.
Example. Minimize $f(x, y) = x^2 + y^2$. Start at $(3, 4)$.
$\nabla f = (2x, 2y)$.
At $(3, 4)$: gradient is $(6, 8)$.
With learning rate $\alpha = 0.1$:
$$(x_{\text{new}}, y_{\text{new}}) = (3, 4) - 0.1 \cdot (6, 8) = (3 - 0.6, 4 - 0.8) = (2.4, 3.2)$$
$f(2.4, 3.2) = 5.76 + 10.24 = 16$, down from $f(3, 4) = 25$. We moved closer to the minimum at $(0, 0)$.
Section 8: Critical Points in Multiple Variables
In single-variable calculus, a critical point is where $f'(c) = 0$. The tangent is horizontal. At such points, $f$ might have a local minimum, a local maximum, or an inflection point (saddle point in 1D).
In multivariable calculus, the analog is:
Definition. A critical point of $f: \mathbb{R}^n \to \mathbb{R}$ is a point where $\nabla f = 0$ - where all partial derivatives are simultaneously zero.
The mapping is direct: $f'(x) = 0$ becomes $\nabla f(x) = 0$.
But there are now more possible behaviors:
- Local minimum: $f$ increases in every direction from the critical point. Like the bottom of a bowl.
- Local maximum: $f$ decreases in every direction. Like the top of a hill.
- Saddle point: $f$ increases in some directions and decreases in others. Like the center of a mountain pass - uphill in the direction you came from, downhill in the direction you are going.
Saddle points exist in 2D and higher but not in 1D (where the “inflection with zero slope” is different - there is no other direction to decrease in). This is one of the genuinely new features of higher dimensions.
Example. $f(x, y) = x^2 - y^2$.
$\nabla f = (2x, -2y)$.
Setting $\nabla f = 0$: $x = 0$ and $y = 0$. The only critical point is the origin.
At the origin: $f$ increases in the $x$-direction (since $\frac{\partial^2 f}{\partial x^2} = 2 > 0$) and decreases in the $y$-direction (since $\frac{\partial^2 f}{\partial y^2} = -2 < 0$). This is a saddle point.
To classify critical points, we need the Hessian - the matrix of second partial derivatives. That is the next post.
Section 9: The Chain Rule for Multiple Variables
In single-variable calculus, if $y = f(x)$ and $x = g(t)$ (so $y$ is indirectly a function of $t$), then:
$$\frac{dy}{dt} = f'(x) \cdot g'(t) = \frac{dy}{dx} \cdot \frac{dx}{dt}$$
The chain rule: multiply the rates of change.
In multivariable calculus, suppose $x = (x_1, \ldots, x_n)$ is a function of $t$ (so each $x_i = x_i(t)$), and $f$ is a function of $x$. Then $f(x(t))$ is a scalar function of $t$, and:
$$\frac{d}{dt} f(x(t)) = \nabla f(x(t)) \cdot x'(t) = \sum_{i=1}^n \frac{\partial f}{\partial x_i} \cdot \frac{dx_i}{dt}$$
where $x'(t) = (x_1'(t), \ldots, x_n'(t))$ is the velocity vector.
The mapping from single-variable: $f'(x) \cdot g'(t)$ becomes $\nabla f \cdot x'(t)$. The scalar product of derivatives becomes a dot product of the gradient with the velocity vector.
Why this matters. Gradient descent is an iteration where $x$ changes with each step (like a discrete “time”). The chain rule tells you how fast the loss changes per step. In backpropagation, this chain rule is applied recursively through layers of a neural network.
Example. Let $f(x, y) = x^2 + y^2$, and let the point move along the path $x(t) = \cos t$, $y(t) = \sin t$ (the unit circle).
$\nabla f = (2x, 2y)$. $x'(t) = (-\sin t, \cos t)$.
$$\frac{d}{dt} f(x(t)) = (2\cos t, 2\sin t) \cdot (-\sin t, \cos t) = -2\cos t \sin t + 2\sin t \cos t = 0$$
The answer is zero - because the unit circle is a level set of $f(x, y) = x^2 + y^2 = 1$, and moving along a level set produces no change in $f$. The gradient is perpendicular to the velocity, so the dot product is zero. The chain rule confirms the geometry.
Section 10: Computing Gradients - Worked Examples
Let us build computational fluency with several examples.
Example 1. $f(x, y, z) = xyz$.
$$\frac{\partial f}{\partial x} = yz, \quad \frac{\partial f}{\partial y} = xz, \quad \frac{\partial f}{\partial z} = xy$$
$$\nabla f = (yz, xz, xy)$$
At $(2, 3, 4)$: $\nabla f = (12, 8, 6)$.
Example 2. $f(x_1, x_2, x_3, x_4) = x_1^2 + 2x_2^2 + 3x_3^2 + 4x_4^2$.
$$\nabla f = (2x_1, 4x_2, 6x_3, 8x_4)$$
This is a weighted sum of squares. The gradient in each direction is proportional to the coefficient for that variable. The gradient points most strongly in the direction of the most-weighted variable.
Example 3: The loss function for linear regression. Given data points and a prediction $\hat{y} = w \cdot x$ (dot product), the mean squared error is:
$$L(w) = \frac{1}{2}|Xw - y|^2 = \frac{1}{2}(Xw - y)^T(Xw - y)$$
where $X$ is a matrix of inputs and $y$ is a vector of targets.
Expanding and differentiating:
$$\nabla_w L = X^T(Xw - y)$$
This is the gradient of the loss with respect to the weight vector $w$. Setting this to zero gives the normal equations $X^T X w = X^T y$, whose solution is the least-squares estimate.
In gradient descent: $w_{\text{new}} = w_{\text{old}} - \alpha X^T(Xw_{\text{old}} - y)$. Each step adjusts $w$ by an amount proportional to the prediction errors $(Xw - y)$, weighted by the input features $X^T$.
Section 11: Gradient as a Linear Approximation
In single-variable calculus, the derivative gives the best linear approximation:
$$f(x + h) \approx f(x) + f'(x) h$$
for small $h$. This is the tangent line approximation.
In multivariable calculus, the gradient gives the analogous approximation:
$$f(x + \delta) \approx f(x) + \nabla f(x) \cdot \delta$$
for small displacement vectors $\delta$. This is the tangent hyperplane approximation.
The mapping is precise: the scalar $f'(x)$ becomes the vector $\nabla f(x)$, and the product $f'(x) h$ becomes the dot product $\nabla f(x) \cdot \delta$.
This approximation is the foundation of gradient descent: we assume the loss function is approximately linear near the current point, and move to the minimum of that linear approximation (which is infinitely far in the negative gradient direction - hence the need for a small step size $\alpha$ to prevent overshooting).
The error in the approximation is proportional to $|\delta|^2$ - second-order small. This is why the approximation is good for small steps but breaks down for large ones.
Section 12: Higher Dimensions and Level Sets
Everything we have discussed generalizes to any dimension.
For $f: \mathbb{R}^n \to \mathbb{R}$:
- The gradient $\nabla f$ is a vector in $\mathbb{R}^n$.
- Level sets are $(n-1)$-dimensional hypersurfaces $\{x \in \mathbb{R}^n : f(x) = c\}$.
- The gradient is perpendicular to the level set at every point.
In 2D: level sets are curves, gradient is a 2D vector perpendicular to those curves. In 3D: level sets are surfaces, gradient is a 3D vector perpendicular to those surfaces. In $n$D: level sets are $(n-1)$-dimensional hyperplanes, gradient is an $n$D vector perpendicular to them.
The geometric intuition stays the same across dimensions - only the visual changes.
Why perpendicularity matters in ML. When training a neural network, the gradient of the loss is perpendicular to the “level surface” of constant loss. Two directions in weight space are equivalent (in terms of loss change) precisely when they lie on the same level surface - that is, when their difference is perpendicular to $\nabla L$. Understanding the geometry of level surfaces helps understand which parameter changes actually matter.
Section 13: When Partial Derivatives Are Not Enough
A subtle but important point: the existence of all partial derivatives does not guarantee that the function behaves as expected.
Consider:
$$f(x, y) = \begin{cases} \frac{xy}{x^2 + y^2} & (x, y) \neq (0, 0) \\ 0 & (x, y) = (0, 0) \end{cases}$$
Compute the partial derivatives at the origin:
$$\frac{\partial f}{\partial x}(0, 0) = \lim_{h \to 0} \frac{f(h, 0) - f(0, 0)}{h} = \lim_{h \to 0} \frac{0 - 0}{h} = 0$$
Similarly $\frac{\partial f}{\partial y}(0, 0) = 0$.
Both partial derivatives exist. But this function is not continuous at the origin. Approach along the line $y = x$:
$$f(t, t) = \frac{t \cdot t}{t^2 + t^2} = \frac{t^2}{2t^2} = \frac{1}{2}$$
So as $(x, y) \to (0, 0)$ along $y = x$, $f$ approaches $1/2$, not $0 = f(0, 0)$.
Discomfort check. How can both partial derivatives be zero (suggesting no change at the origin) while the function is not even continuous there? Because partial derivatives measure change in only two directions - along the axes. They say nothing about what happens in diagonal directions. A function can behave perfectly well along the axes but badly along diagonals. This is why the full notion of differentiability (the Fréchet derivative) requires the function to be well-approximated by a linear map in all directions simultaneously, not just along coordinate axes.
The correct notion: $f$ is differentiable at $x$ if there exists a linear map $L: \mathbb{R}^n \to \mathbb{R}$ such that:
$$\lim_{|\delta| \to 0} \frac{|f(x + \delta) - f(x) - L(\delta)|}{|\delta|} = 0$$
When this holds, $L(\delta) = \nabla f(x) \cdot \delta$, and the gradient is exactly this linear map. For functions with continuous partial derivatives (the case you will encounter in practice), differentiability is guaranteed.
Section 14: The Gradient in Machine Learning
Let us see what the gradient looks like in the context of machine learning, where $f$ is a loss function with millions of inputs.
A neural network has parameters $\theta \in \mathbb{R}^n$ where $n$ might be $10^8$. The loss $L(\theta)$ takes these million-dimensional parameters and outputs a scalar. The gradient $\nabla_\theta L$ is a vector of the same dimension $n$, telling you how much the loss changes per unit change in each parameter.
The training loop is gradient descent:
$$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)$$
Each parameter $\theta_i$ is updated by subtracting the corresponding component of the gradient, scaled by the learning rate. Parameters that contribute more to the loss (larger $|\partial L / \partial \theta_i|$) get adjusted more. Parameters that do not affect the loss ($\partial L / \partial \theta_i = 0$) are not adjusted at all.
Computing the gradient efficiently. Naively, you could compute each partial derivative separately by perturbing one parameter at a time and measuring the change in loss. That would require $n$ forward passes through the network, which for $n = 10^8$ is not feasible.
Backpropagation computes all $n$ partial derivatives in a single backward pass through the network, using the chain rule. This is the algorithmic trick that makes neural network training feasible. The chain rule for partial derivatives, applied recursively layer by layer, gives the full gradient at the cost of roughly one additional forward pass.
Section 15: The Gradient and Level Curves - A Complete Picture
Let us work out the geometry of the gradient completely for a specific function, so the picture is unambiguous.
Take $f(x, y) = x^2 + 4y^2$. This is an elliptic paraboloid - it looks like a bowl that is wider in the $y$-direction than in the $x$-direction.
Level curves. The level curve $f = c$ is the ellipse $x^2 + 4y^2 = c$. As $c$ increases from 0, the ellipses grow outward. The curves are closer together where $f$ changes fast and farther apart where it changes slowly.
The gradient field. $\nabla f = (2x, 8y)$.
At the point $(2, 1)$: $\nabla f = (4, 8)$. The gradient points mostly in the $y$-direction. Why? Because $f$ increases much faster in the $y$-direction (coefficient 4 in $4y^2$) than in the $x$-direction (coefficient 1 in $x^2$). Moving one unit in $y$ increases $f$ by about 8 at this point; moving one unit in $x$ increases $f$ by about 4.
Perpendicularity check. The level curve through $(2, 1)$ is $x^2 + 4y^2 = 8$. Parameterize it near $(2, 1)$: differentiating $x^2 + 4y^2 = 8$ implicitly gives $2x + 8y\frac{dy}{dx} = 0$, so $\frac{dy}{dx} = -\frac{x}{4y} = -\frac{2}{4} = -\frac{1}{2}$. The tangent direction to the level curve at $(2, 1)$ is proportional to $(1, -1/2)$, or equivalently $(2, -1)$.
Check perpendicularity: $(4, 8) \cdot (2, -1) = 8 - 8 = 0$. The gradient $(4, 8)$ is perpendicular to the tangent $(2, -1)$. Confirmed.
Following the gradient uphill. Starting from $(1, 0.5)$, gradient descent with step size $\alpha = 0.1$ takes us away from the minimum:
Step 1: $\nabla f(1, 0.5) = (2, 4)$. Moving in gradient direction: $(1, 0.5) + 0.1 \cdot (2, 4) = (1.2, 0.9)$. The function value increases from $f(1, 0.5) = 1 + 1 = 2$ to $f(1.2, 0.9) = 1.44 + 3.24 = 4.68$. Confirmed: following $+\nabla f$ increases $f$.
Gradient descent moves in $-\nabla f$: $(1, 0.5) - 0.1 \cdot (2, 4) = (0.8, 0.1)$. Function value: $f(0.8, 0.1) = 0.64 + 0.04 = 0.68 < 2$. Confirmed: moving opposite to gradient decreases $f$.
Why gradient descent can be slow. Notice the step moved much more in the $y$-direction (from 0.5 to 0.1, a change of 0.4) than in the $x$-direction (from 1 to 0.8, a change of 0.2). But the minimum is at $(0, 0)$, which requires equal movement in both directions. The gradient is “pulling” us too hard in $y$ relative to $x$. This is the condition number problem - a preview of the Hessian material.
Section 16: Natural Gradient and Riemannian Geometry
The gradient as we have defined it depends on the geometry of the parameter space. We assumed the standard Euclidean metric: the distance between $\theta$ and $\theta + \delta$ is $|\delta|_2 = \sqrt{\sum_i \delta_i^2}$.
But this is not always the right metric. Parameters of a probability distribution live in a curved space where the natural notion of distance is the Fisher information metric.
The problem with Euclidean gradient for probability models. Suppose you have a parameterized distribution $p(x; \theta)$. Moving in the direction $\delta\theta$ in parameter space changes the distribution. But the same Euclidean step $|\delta\theta| = 0.1$ might cause a large change in distribution for some directions and a small change for others, depending on how the parameters affect the distribution shape.
The natural gradient. Define the Fisher information matrix:
$$F(\theta) = \mathbb{E}{x \sim p(x;\theta)}\left[\nabla\theta \log p(x;\theta) \nabla_\theta \log p(x;\theta)^T\right]$$
This matrix measures the curvature of the probability model with respect to its parameters. The natural gradient is:
$$\tilde{\nabla}\theta L = F(\theta)^{-1} \nabla\theta L$$
This rescales the gradient by the Fisher information, correcting for the non-Euclidean geometry of the probability simplex.
Why it matters. The natural gradient update $\theta \leftarrow \theta - \alpha F^{-1} \nabla L$ takes steps of equal size in distribution space (measured by KL divergence), not in parameter space. This can dramatically speed up convergence when different parameters have very different effects on the distribution.
Connection to Newton’s method. For exponential family models, the Fisher information equals the Hessian of the negative log-likelihood. So the natural gradient update is exactly Newton’s method for those models.
In practice, the Fisher matrix is approximated (K-FAC, EWC) because computing $F^{-1}$ exactly is as expensive as inverting the Hessian. But the idea that “the right metric for optimization is the one that respects the geometry of the output space” is fundamental.
Discomfort check. The standard gradient $\nabla L$ implicitly uses the identity metric on parameter space - it treats all parameter directions as equally important. The natural gradient $F^{-1} \nabla L$ uses the Fisher metric - it treats parameter directions as important in proportion to their effect on the model distribution. Neither is more “correct” mathematically; the choice of metric is a modeling decision. The Euclidean gradient is simple and cheap. The natural gradient is geometrically principled but expensive. Most deep learning uses the Euclidean gradient with various heuristics (Adam, learning rate schedules) to compensate for not using the right metric.
Section 17: Numerical Gradient Checking
When implementing gradient computations (in a custom layer or loss function), how do you verify that your gradient is correct?
The answer: finite differences. For small $h > 0$, the partial derivative is approximated by:
$$\frac{\partial f}{\partial x_i} \approx \frac{f(x + h e_i) - f(x - h e_i)}{2h}$$
This is the centered difference approximation. It is more accurate than the one-sided approximation $\frac{f(x + h e_i) - f(x)}{h}$ because the error is $O(h^2)$ rather than $O(h)$.
To check your analytically computed gradient $\nabla f$ against the numerical gradient:
- Pick a random test point $x$.
- For each coordinate $i$, compute the finite difference approximation $g_i$.
- Compute the relative error: $\frac{|\nabla f(x) - g|}{|\nabla f(x)| + |g|}$.
- If this is less than $10^{-5}$ or so (for 64-bit floating point), the gradient is likely correct.
The catch. Computing the numerical gradient requires $2n$ function evaluations (one at $x + h e_i$ and one at $x - h e_i$ for each of the $n$ parameters). For a neural network with $n = 10^8$ parameters, this is completely infeasible in production.
Gradient checking is used during development: you test on a small network or a small subset of parameters to verify correctness, then run with the analytical gradient in production.
Common bugs caught by gradient checking:
- Sign errors: $+\nabla f$ instead of $-\nabla f$.
- Missing chain rule terms: forgetting to multiply by an upstream gradient.
- Wrong shapes: transposing a matrix gradient when you should not.
- Broadcasting errors: summing over the wrong axis.
The practice of gradient checking is a concrete application of the limit definition of the derivative: the partial derivative is approximated by a small finite difference, and if your analytical formula agrees with this approximation, you have the right formula.
Summary
| Concept | Single-Variable Analog | Multivariable Version |
|---|---|---|
| Function type | $f: \mathbb{R} \to \mathbb{R}$ | $f: \mathbb{R}^n \to \mathbb{R}$ |
| Derivative | $f'(x)$ - a number | $\nabla f(x)$ - a vector in $\mathbb{R}^n$ |
| Rate of change | One number (the slope) | Depends on direction |
| Partial derivative | $f'(x)$ (differentiate in $x$) | $\frac{\partial f}{\partial x_i}$ (freeze others, differentiate in $x_i$) |
| Gradient | $f'(x)$ (one component) | $\nabla f = (\partial f/\partial x_1, \ldots, \partial f/\partial x_n)$ |
| Directional derivative | $f'(x) \cdot \text{sign}(v)$ | $\nabla f \cdot v$ (dot product) |
| Steepest ascent direction | $\text{sign}(f'(x))$ | $\nabla f / |\nabla f|$ |
| Critical point | $f'(c) = 0$ | $\nabla f(c) = 0$ |
| Descent update | $x \leftarrow x - \alpha f'(x)$ | $x \leftarrow x - \alpha \nabla f(x)$ |
| Chain rule | $\frac{d}{dt} f(x(t)) = f'(x) x'(t)$ | $\frac{d}{dt} f(x(t)) = \nabla f \cdot x'(t)$ |
| Linear approximation | $f(x+h) \approx f(x) + f'(x)h$ | $f(x+\delta) \approx f(x) + \nabla f \cdot \delta$ |
The gradient is not a mysterious new object. It is the natural extension of the derivative to functions of multiple variables. Everywhere $f'(x)$ appears in single-variable calculus, $\nabla f(x)$ appears in its place - a vector where there used to be a scalar, a dot product where there used to be a product. The intuition transfers completely.
Read Next: