Derivatives - The Geometry of Instantaneous Change
Helpful context:
- Limits & Continuity - What a Function Intends to Do
- The Number e - The Base That Is Its Own Derivative
A ball falls. Its height is $s(t) = 5t^2$ meters after $t$ seconds. You ask: how fast is it moving at the instant $t = 2$?
Not over the first two seconds. Not over the next half second. Right now, at $t = 2$.
This is the velocity problem, and it is the question that invented calculus. Average velocity is easy - distance over time. But instantaneous velocity requires dividing zero distance by zero time, which is $0/0$, which is meaningless. Unless you treat it as a limit.
Compute the average velocity from $t = 2$ to $t = 3$:
$$\frac{s(3) - s(2)}{3 - 2} = \frac{45 - 20}{1} = 25 \text{ m/s}.$$
Too coarse. Narrow the interval to $t = 2$ to $t = 2.1$:
$$\frac{s(2.1) - s(2)}{0.1} = \frac{22.05 - 20}{0.1} = 20.5 \text{ m/s}.$$
Better. Now from $t = 2$ to $t = 2 + h$ for a general small $h$:
$$\frac{s(2+h) - s(2)}{h} = \frac{5(2+h)^2 - 20}{h} = \frac{20h + 5h^2}{h} = 20 + 5h.$$
The $h$ in the denominator cancelled. Now take $h \to 0$: this approaches $20$. The instantaneous velocity at $t = 2$ is exactly $20$ m/s.
We just computed $s'(2)$ without calling it that. The derivative is this limit. Every piece of notation, every rule, every theorem in differential calculus is an elaboration of the maneuver we just performed.
Section 1: The Definition
Fix a function $f$ and a point $x$. Choose a nearby point $x + h$. The secant line through $(x, f(x))$ and $(x+h, f(x+h))$ has slope:
$$m_{\text{sec}} = \frac{f(x+h) - f(x)}{h}.$$
The numerator is the rise; the denominator is the run; the ratio is how much $f$ changes per unit change in input, on average over the interval $[x, x+h]$.
f(x) = x² with secant lines converging to the tangent. At h=1 the secant (red) has slope 5; at h=0.5 (orange) slope 4.5; as h→0 both approach the tangent (gray) with slope 4 at x=2.
As $h \to 0$, the second point slides along the curve toward the first. The secant line rotates. In the limit, it approaches the tangent line - the line that just touches the curve at $(x, f(x))$ with the same slope the curve has at that point.
Definition (The Derivative). The derivative of $f$ at $x$ is:
$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h},$$
provided this limit exists. If it does, $f$ is differentiable at $x$. The derivative is the slope of the tangent line.
Discomfort check. At $h = 0$, the fraction $\frac{f(x+h) - f(x)}{h}$ becomes $\frac{0}{0}$, which is undefined. So how can we “take the limit as $h \to 0$”?
This is precisely the subtlety of limits: we ask what value the expression approaches as $h \to 0$, never what it equals at $h = 0$. The expression is not defined at $h = 0$ - and it doesn’t need to be. The limit is defined by the behavior near $h = 0$.
For polynomials, algebraic cancellation removes $h$ from the denominator before we let $h \to 0$. In the falling-ball example, $(20h + 5h^2)/h = 20 + 5h$ - now there is no division by zero, and the limit is straightforwardly $20$. The limit “works” because the numerator and denominator both vanish in a coordinated way: the numerator vanishes because of the $h$ in the denominator. Cancelling it is legitimate whenever $h \neq 0$, and we only take the limit after that cancellation.
Notations
The derivative has accumulated notation across three centuries of use. All of the following mean the same thing:
$$f'(x), \quad \frac{df}{dx}, \quad \frac{d}{dx}f(x), \quad Df(x), \quad \dot{f}(x) \text{ (physics, time derivatives)}.$$
Leibniz’s $\frac{df}{dx}$ is the most flexible. When you write $\frac{dy}{dx}$, you are thinking of $y = f(x)$ and asking how $y$ changes per unit change in $x$. Leibniz notation is essential for the chain rule.
Section 2: The Derivative as a Function
The derivative $f'(a)$ at a specific point $a$ is a number. But we can ask: at which points does $f$ have a derivative, and what is its value? This gives a new function $f': x \mapsto f'(x)$, defined wherever the limit exists. The process of finding $f'$ from $f$ is differentiation.
Example. Let $f(x) = x^2$. Compute $f'(x)$ from the definition:
$$f'(x) = \lim_{h \to 0} \frac{(x+h)^2 - x^2}{h} = \lim_{h \to 0} \frac{x^2 + 2xh + h^2 - x^2}{h} = \lim_{h \to 0} \frac{2xh + h^2}{h} = \lim_{h \to 0}(2x + h) = 2x.$$
So $(x^2)' = 2x$. This is itself a function of $x$. It tells us:
- At $x = 2$: the parabola $y = x^2$ has slope $4$.
- At $x = -3$: slope $-6$. The function is steeply decreasing.
- At $x = 0$: slope $0$. The tangent is horizontal - this is the vertex of the parabola.
The entire shape of the parabola is encoded in $f'(x) = 2x$: positive slope on the right, negative on the left, zero at the bottom.
Section 3: Differentiation Rules
Computing from the limit definition every time would be tedious. We derive rules once and apply them freely.
The Constant Rule
$(c)' = 0$ for any constant $c$.
A constant function never changes. Formally: $\frac{c - c}{h} = 0$ for all $h$, so the limit is $0$.
The Power Rule
$$(x^n)' = nx^{n-1}.$$
Proof (for positive integer $n$). Use the binomial theorem:
$$(x+h)^n = x^n + nx^{n-1}h + \binom{n}{2}x^{n-2}h^2 + \cdots + h^n.$$
Subtract $x^n$ and divide by $h$:
$$\frac{(x+h)^n - x^n}{h} = nx^{n-1} + \binom{n}{2}x^{n-2}h + \cdots + h^{n-1}.$$
Every term except the first contains $h$ as a factor. As $h \to 0$, they all vanish. The limit is $nx^{n-1}$. $\blacksquare$
The rule extends to all real exponents $n \in \mathbb{R}$ (proved via logarithmic differentiation):
- $(x^3)' = 3x^2$
- $(x^{1/2})' = \frac{1}{2}x^{-1/2} = \frac{1}{2\sqrt{x}}$
- $(x^{-1})' = -x^{-2} = -\frac{1}{x^2}$
- $(x^{\pi})' = \pi x^{\pi - 1}$
The Sum Rule and Linearity
$$(f + g)' = f' + g', \quad (cf)' = cf'.$$
Differentiation is a linear operation: it commutes with addition and scalar multiplication. The limit of a sum is the sum of the limits. This means:
$$(3x^4 - 7x^2 + 5)' = 12x^3 - 14x.$$
The Product Rule
$$(fg)' = f’g + fg'.$$
Proof. Add and subtract $f(x)g(x+h)$ in the numerator:
$$\frac{f(x+h)g(x+h) - f(x)g(x)}{h}$$
$$= \frac{f(x+h)g(x+h) - f(x)g(x+h) + f(x)g(x+h) - f(x)g(x)}{h}$$
$$= \frac{f(x+h) - f(x)}{h} \cdot g(x+h) + f(x) \cdot \frac{g(x+h) - g(x)}{h}.$$
As $h \to 0$: the first fraction converges to $f'(x)$; $g(x+h) \to g(x)$ (by continuity, since differentiability implies continuity); the second fraction converges to $g'(x)$. $\blacksquare$
Intuition. If both $f$ and $g$ change by small amounts $\Delta f$ and $\Delta g$, the product changes by:
$$\Delta(fg) = (f + \Delta f)(g + \Delta g) - fg = f\Delta g + g\Delta f + \Delta f \cdot \Delta g.$$
The last term $\Delta f \cdot \Delta g$ is second-order small and vanishes in the limit. What remains: $f\Delta g + g\Delta f$, which is $fg' + gf'$.
Mnemonic. “Derivative of first times second, plus first times derivative of second.”
Example. $(x^2 \sin x)' = 2x \sin x + x^2 \cos x$.
The Quotient Rule
$$\left(\frac{f}{g}\right)' = \frac{f’g - fg'}{g^2}, \quad g \neq 0.$$
Derivation. Write $f/g = f \cdot g^{-1}$ and apply the product rule:
$$(f \cdot g^{-1})' = f' \cdot g^{-1} + f \cdot (g^{-1})'.$$
We need $(g^{-1})' = -g'/g^2$ (proved by implicit differentiation or by the chain rule once we have it). Substituting:
$$= \frac{f'}{g} - \frac{fg'}{g^2} = \frac{f’g - fg'}{g^2}. \quad \blacksquare$$
Mnemonic. “Lo d-hi minus hi d-lo, over lo-squared.” ($\frac{\text{denom} \cdot (\text{numer})' - \text{numer} \cdot (\text{denom})'}{(\text{denom})^2}$.)
The Chain Rule
This is the most important rule in calculus.
Theorem. If $g$ is differentiable at $x$ and $f$ is differentiable at $g(x)$, then:
$$\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x).$$
In Leibniz notation: if $y = f(u)$ and $u = g(x)$, then:
$$\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}.$$
Discomfort check. The Leibniz form $\frac{dy}{du} \cdot \frac{du}{dx}$ looks like you are “canceling” the $du$’s, as if they were ordinary fractions. You cannot literally do this - $dy$, $du$, $dx$ are not numbers, they are notation for limiting processes. But the chain rule is designed so that this mnemonic works. Think of it as a fortunate coincidence built into the notation. It becomes a genuine statement when we move to differentials in differential geometry - but that requires more machinery. For now: the cancellation is a mnemonic, not a proof.
Proof sketch. The naive proof writes:
$$\frac{f(g(x+h)) - f(g(x))}{h} = \frac{f(g(x+h)) - f(g(x))}{g(x+h) - g(x)} \cdot \frac{g(x+h) - g(x)}{h}.$$
As $h \to 0$, the right factor approaches $g'(x)$, and the left factor approaches $f'(g(x))$. But this argument has a flaw: if $g(x+h) = g(x)$ for arbitrarily small $h \neq 0$, we are dividing by zero. The rigorous proof handles this by rewriting $f$ near $g(x)$ as $f(g(x)) + [f'(g(x)) + \varepsilon(k)] \cdot k$ where $\varepsilon(k) \to 0$ as $k \to 0$.
Examples.
$\displaystyle\frac{d}{dx}\sin(x^2) = \cos(x^2) \cdot 2x$
Here: outer function $f(u) = \sin u$, inner function $g(x) = x^2$. $f'(u) = \cos u$, $g'(x) = 2x$. Chain rule: $\cos(g(x)) \cdot g'(x) = \cos(x^2) \cdot 2x$.
$\displaystyle\frac{d}{dx}e^{-x^2/2} = e^{-x^2/2} \cdot \left(-x\right)$
This is the derivative of the Gaussian kernel - it appears constantly in probability and statistics. Outer: $f(u) = e^u$, inner: $g(x) = -x^2/2$. $f'(u) = e^u$, $g'(x) = -x$. Chain rule: $e^{-x^2/2} \cdot (-x)$.
The chain rule generalizes to any number of compositions:
$$\frac{d}{dx}f(g(h(x))) = f'(g(h(x))) \cdot g'(h(x)) \cdot h'(x).$$
This is the mathematical structure of backpropagation - more on this in Section 12.
Section 4: Standard Function Derivatives
Trigonometric Functions
Use the addition formula $\sin(x+h) = \sin x \cos h + \cos x \sin h$:
$$\frac{\sin(x+h) - \sin x}{h} = \sin x \cdot \frac{\cos h - 1}{h} + \cos x \cdot \frac{\sin h}{h}.$$
Two fundamental limits (proved via the squeeze theorem and geometric inequalities):
$$\lim_{h \to 0}\frac{\sin h}{h} = 1, \qquad \lim_{h \to 0}\frac{\cos h - 1}{h} = 0.$$
Therefore $(\sin x)' = \sin x \cdot 0 + \cos x \cdot 1 = \cos x$.
Similarly, $(\cos x)' = -\sin x$.
For the tangent: $(\tan x)' = \sec^2 x$. Proof: apply the quotient rule to $\sin x / \cos x$.
The Exponential Function
The number $e \approx 2.71828\ldots$ is defined by the property $\lim_{h \to 0} \frac{e^h - 1}{h} = 1$. This gives:
$$\frac{d}{dx}e^x = \lim_{h \to 0}\frac{e^{x+h} - e^x}{h} = e^x \cdot \lim_{h \to 0}\frac{e^h - 1}{h} = e^x \cdot 1 = e^x.$$
The exponential function is its own derivative. This is not a coincidence - it is essentially the definition of $e$. No other base has this property. The exponential is the unique function (up to scaling) satisfying $f' = f$.
For a general base: $(a^x)' = a^x \ln a$. (Follows from writing $a^x = e^{x \ln a}$ and applying the chain rule.)
The Natural Logarithm
Using the inverse function theorem: if $f$ is differentiable and invertible with $f' \neq 0$, then $(f^{-1})'(x) = \frac{1}{f'(f^{-1}(x))}$.
Apply to $f(x) = e^x$, so $f^{-1}(x) = \ln x$:
$$(\ln x)' = \frac{1}{e^{\ln x}} = \frac{1}{x}, \quad x > 0.$$
Summary table of standard derivatives:
| Function | Derivative |
|---|---|
| $c$ (constant) | $0$ |
| $x^n$ | $nx^{n-1}$ |
| $e^x$ | $e^x$ |
| $\ln x$ | $1/x$ |
| $\sin x$ | $\cos x$ |
| $\cos x$ | $-\sin x$ |
| $\tan x$ | $\sec^2 x$ |
| $a^x$ | $a^x \ln a$ |
| $\log_a x$ | $\frac{1}{x \ln a}$ |
| $\arcsin x$ | $\frac{1}{\sqrt{1-x^2}}$ |
| $\arctan x$ | $\frac{1}{1+x^2}$ |
Section 5: Geometric Interpretation - Tangent Lines and Linearization
The tangent line to $y = f(x)$ at $x = a$ is:
$$y = f(a) + f'(a)(x - a).$$
This is the best linear approximation to $f$ near $a$. How good is it? For small $\Delta x = x - a$:
$$f(a + \Delta x) \approx f(a) + f'(a)\Delta x.$$
This is linearization: replace the function by the tangent line locally. The error is proportional to $(\Delta x)^2$ - second-order small.
f(x) = √x (blue) and its tangent line at a=4: L(x) = 2 + ¼(x−4) (orange). Near x=4 the two curves are nearly indistinguishable - that is linearization.
Example. Approximate $\sqrt{1.02}$. Let $f(x) = \sqrt{x} = x^{1/2}$, so $f'(x) = \frac{1}{2}x^{-1/2}$. At $a = 1$: $f(1) = 1$ and $f'(1) = 1/2$.
$$\sqrt{1.02} \approx 1 + \frac{1}{2}(0.02) = 1.01.$$
Actual value: $1.00995…$. The error is less than $0.0001$.
This is the first-order Taylor approximation. It is used everywhere:
- Physics: small-angle approximation $\sin\theta \approx \theta$ for small $\theta$ (linearization of $\sin$ at $0$).
- Engineering: perturbation analysis around an operating point.
- ML: gradient descent updates the parameters by the tangent-plane approximation to the loss.
- Numerical methods: Newton’s method for root-finding uses the tangent line to predict the root.
The linearization is the foundation of Taylor series, which we will develop later.
Section 6: Differentiability vs. Continuity
Theorem. Differentiability implies continuity.
Proof. Suppose $f'(a)$ exists. We want to show $f$ is continuous at $a$, i.e., $\lim_{h \to 0}f(a+h) = f(a)$.
Write:
$$f(a+h) - f(a) = \frac{f(a+h) - f(a)}{h} \cdot h.$$
As $h \to 0$, the fraction converges to $f'(a)$ (by assumption) and $h \to 0$. By the product rule for limits:
$$\lim_{h \to 0}[f(a+h) - f(a)] = f'(a) \cdot 0 = 0.$$
So $f(a+h) \to f(a)$. The function is continuous at $a$. $\blacksquare$
The converse fails. Continuity does not imply differentiability.
Example. $f(x) = |x|$ is continuous everywhere, but not differentiable at $x = 0$.
From the right: $\lim_{h \to 0^+}\frac{|0 + h| - 0}{h} = \lim_{h \to 0^+}\frac{h}{h} = 1$.
From the left: $\lim_{h \to 0^-}\frac{|0 + h| - 0}{h} = \lim_{h \to 0^-}\frac{-h}{h} = -1$.
The one-sided limits disagree. The tangent line “wants to be” two different lines - the function has a corner at the origin. No unique tangent exists.
Discomfort check. Why does differentiability imply continuity but not vice versa? Intuitively, differentiability is a stronger condition. To have a well-defined tangent line, the function must be “smooth” at the point, which is more than merely “no jumps.” Continuity rules out vertical jumps but permits corners and cusps. Differentiability rules out corners and cusps too.
The proof above shows the mechanism: if $f$ has a finite derivative at $a$, the difference $f(a+h) - f(a)$ is $\approx f'(a) \cdot h$, which forces the difference to go to zero as $h \to 0$. The derivative “controls” how fast $f$ can change. But the converse fails because continuity only says the difference goes to zero - it says nothing about how fast, and the ratio $\frac{f(a+h)-f(a)}{h}$ could still diverge (like the reciprocal of $h$).
Weierstrass’s construction (1872) shows the extreme case: a function that is continuous everywhere but differentiable nowhere. It oscillates so rapidly at every scale that no tangent line can be assigned at any point.
In ML: ReLU. The ReLU activation $\text{ReLU}(x) = \max(0, x)$ is continuous but not differentiable at $x = 0$. In practice, training is unaffected because the set $\{x = 0\}$ has measure zero - the probability that a neuron’s input is exactly $0$ is negligible. At $x = 0$, a subgradient (any value in $[0, 1]$) is used. The nondifferentiability is theoretically present but practically irrelevant.
Section 7: Higher Derivatives
If $f'$ is itself differentiable, we can differentiate again. The second derivative is:
$$f''(x) = (f')'(x) = \frac{d^2 f}{dx^2}.$$
More generally, the $n$-th derivative is $f^{(n)}(x)$.
Geometric meaning of $f''$:
- $f''(x) > 0$: the slope $f'$ is increasing. The function is concave up (opens upward, like a bowl that holds water). The curve bends away from the tangent line on the upside.
- $f''(x) < 0$: the slope $f'$ is decreasing. The function is concave down (like a hill, or a bowl held upside down). The curve bends below the tangent line.
- $f''(x) = 0$ at an inflection point: the concavity changes sign.
Physics. If $s(t)$ is position, then:
- $v(t) = s'(t)$ is velocity
- $a(t) = s''(t)$ is acceleration
$F = ma$ is Newton’s second law: force equals mass times the second derivative of position. Differential equations - the language of physics - are equations that relate a function to its derivatives.
Section 8: Critical Points and Optimization
A critical point of $f$ is a point where $f'(c) = 0$ or $f'(c)$ does not exist. At a critical point, the tangent line is horizontal (or does not exist).
At a critical point, three cases are possible:
- Local minimum: $f$ decreases then increases through $c$.
- Local maximum: $f$ increases then decreases through $c$.
- Saddle point (inflection with zero slope): $f$ is monotone through $c$.
First Derivative Test. Check the sign of $f'$ on either side of $c$:
- Sign changes from $-$ to $+$: local minimum.
- Sign changes from $+$ to $-$: local maximum.
- Sign does not change: neither.
Second Derivative Test. If $f'(c) = 0$:
- $f''(c) > 0$: local minimum. (The slope is zero and increasing - the function is at the bottom of a bowl.)
- $f''(c) < 0$: local maximum. (The slope is zero and decreasing - the function is at the top of a hill.)
- $f''(c) = 0$: inconclusive.
Proof sketch. If $f''(c) > 0$, then $f'$ is increasing near $c$. Since $f'(c) = 0$, we have $f' < 0$ just left of $c$ (the function is decreasing) and $f' > 0$ just right of $c$ (the function is increasing). So $f$ dips to a minimum at $c$. $\blacksquare$
When $f''(c) = 0$, anything can happen: $x^4$ has a minimum at $0$ with $f''(0) = 0$; $x^3$ has an inflection point at $0$ with $f'(0) = f''(0) = 0$.
Example. Minimize $f(x) = x^2 - 4x + 7$.
$f'(x) = 2x - 4 = 0 \Rightarrow x = 2$. Then $f''(x) = 2 > 0$, so $x = 2$ is a minimum. Minimum value: $f(2) = 4 - 8 + 7 = 3$.
This is the foundation of all continuous optimization. In machine learning, training minimizes a loss function $L(\theta)$ by following the gradient - the direct application of these ideas in high dimensions.
Section 9: Implicit Differentiation
So far, $y = f(x)$ is explicit: $y$ is a function of $x$ given by a formula. Sometimes $y$ is defined implicitly by an equation relating $x$ and $y$.
When to reach for implicit differentiation. Three situations:
-
The equation defines a curve you cannot solve for $y$ globally. The unit circle $x^2 + y^2 = 1$ would split into two separate branches. More complex equations like $y^5 + 2xy = x^3$ have no closed-form solution for $y$ at all. Implicit differentiation gives $dy/dx$ without ever isolating $y$.
-
You want the derivative of an inverse function. If $y = \arcsin x$, the equation it satisfies is $\sin y = x$. Differentiate both sides: $\cos y \cdot \frac{dy}{dx} = 1$, so $\frac{dy}{dx} = \frac{1}{\sqrt{1-x^2}}$. The general strategy: whenever you want $(f^{-1})'$, write $f(y) = x$, differentiate implicitly, solve for $\frac{dy}{dx}$. This is how all inverse trig derivatives are derived.
-
Two quantities are related by a constraint and both change over time. A ladder of length $L$ with foot at $x$ and top at $y$ satisfies $x^2 + y^2 = L^2$. Differentiate with respect to time: $2x\dot{x} + 2y\dot{y} = 0$, so $\dot{y} = -(x/y)\dot{x}$. This is the technique of related rates: the constraint equation links the rates of change even when the quantities themselves are not given explicitly.
Example: the unit circle $x^2 + y^2 = 1$.
This is not a function (it fails the vertical line test - two $y$-values for each $x$). But along each branch, $y$ is locally a function of $x$. To find $dy/dx$, differentiate both sides with respect to $x$, treating $y$ as an (implicit) function of $x$:
$$\frac{d}{dx}(x^2 + y^2) = \frac{d}{dx}(1)$$
$$2x + 2y\frac{dy}{dx} = 0$$
$$\frac{dy}{dx} = -\frac{x}{y}.$$
Check with geometry: at $(1/\sqrt{2},\ 1/\sqrt{2})$, the slope is $-1$. The tangent line has slope $-1$, consistent with the $45°$ angle. At $(1, 0)$, the slope is $-1/0$, undefined - the tangent is vertical. Both are geometrically correct.
Another example: $y^5 + 2xy = x^3$.
Differentiate:
$$5y^4\frac{dy}{dx} + 2y + 2x\frac{dy}{dx} = 3x^2$$
$$(5y^4 + 2x)\frac{dy}{dx} = 3x^2 - 2y$$
$$\frac{dy}{dx} = \frac{3x^2 - 2y}{5y^4 + 2x}.$$
We cannot solve for $y$ explicitly here - but we can still differentiate.
Logarithmic differentiation. For products and powers, take $\ln$ of both sides first.
When to use it. Two signals: (1) variable appears in both base and exponent - expressions like $x^x$ or $(\sin x)^{x^2}$, where neither the power rule (fixed exponent) nor the exponential rule (fixed base) applies; (2) the function is a complicated product, quotient, or power - taking $\ln$ converts multiplication to addition and pulls exponents down as multipliers, collapsing a chain of nested product and chain rules into straightforward algebra.
Example 1. Differentiate $y = x^x$.
Take $\ln$: $\ln y = x \ln x$. Differentiate:
$$\frac{1}{y}\frac{dy}{dx} = \ln x + x \cdot \frac{1}{x} = \ln x + 1.$$
$$\frac{dy}{dx} = y(\ln x + 1) = x^x(\ln x + 1).$$
Example 2. Differentiate $y = \dfrac{(x+1)^3(x-2)^4}{\sqrt{x^2+1}}$.
Take $\ln$: $\ln y = 3\ln(x+1) + 4\ln(x-2) - \tfrac{1}{2}\ln(x^2+1)$.
Differentiate both sides:
$$\frac{y'}{y} = \frac{3}{x+1} + \frac{4}{x-2} - \frac{x}{x^2+1}.$$
Multiply through by $y$:
$$y' = \frac{(x+1)^3(x-2)^4}{\sqrt{x^2+1}} \cdot \left(\frac{3}{x+1} + \frac{4}{x-2} - \frac{x}{x^2+1}\right).$$
No chain of nested product and quotient rules. The $\ln$ absorbs the structure.
Section 10: L’Hôpital’s Rule
When a limit produces the indeterminate form $0/0$ or $\infty/\infty$, the numerator and denominator are in a race: both are heading toward the same value, and the ratio depends on which one gets there faster. L’Hôpital’s rule replaces the question “which function is larger near $a$?” with “which function has the larger slope near $a$?” - a question derivatives answer directly.
Theorem (L’Hôpital’s Rule). If $\lim_{x \to a} f(x) = \lim_{x \to a} g(x) = 0$ (or both $\pm\infty$), and $g'(x) \neq 0$ near $a$, and $\lim_{x \to a}\frac{f'(x)}{g'(x)}$ exists, then:
$$\lim_{x \to a}\frac{f(x)}{g(x)} = \lim_{x \to a}\frac{f'(x)}{g'(x)}.$$
The rule also applies as $x \to \pm\infty$.
Geometric Intuition
Focus on the $0/0$ case, where $f(a) = g(a) = 0$. Both curves pass through the same point $(a, 0)$. Near $a$, each function is well approximated by its tangent line at $a$. Since both tangent lines pass through $(a, 0)$:
$$f(x) \approx f'(a)(x - a), \qquad g(x) \approx g'(a)(x - a).$$
Their ratio near $a$:
$$\frac{f(x)}{g(x)} \approx \frac{f'(a)(x - a)}{g'(a)(x - a)} = \frac{f'(a)}{g'(a)}.$$
The $(x - a)$ factors cancel because both functions depart from zero at the same location. All that survives is the ratio of departure rates - the ratio of slopes.
Two curves that both touch zero at $x = a$ have a ratio near $a$ determined entirely by their slopes there. The $0/0$ form is not a mystery or a failure of arithmetic - it is a question about the relative speed of approach to zero, which derivatives measure exactly.
Rigorous Proof (the $0/0$ case)
The proof uses a generalization of the Mean Value Theorem.
Theorem (Cauchy Mean Value Theorem). If $f$ and $g$ are continuous on $[a, b]$ and differentiable on $(a, b)$, then there exists $c \in (a, b)$ with:
$$f'(c)\bigl(g(b) - g(a)\bigr) = g'(c)\bigl(f(b) - f(a)\bigr).$$
When $g(b) \neq g(a)$ and $g'(c) \neq 0$, this gives $\dfrac{f'(c)}{g'(c)} = \dfrac{f(b) - f(a)}{g(b) - g(a)}$.
Proof. Define $h(x) = f(x)\bigl(g(b) - g(a)\bigr) - g(x)\bigl(f(b) - f(a)\bigr)$. Check: $h(a) = f(a)g(b) - g(a)f(b)$ and $h(b) = f(b)g(a) - g(b)f(a) = h(a)$. By Rolle’s Theorem (Section 11), there exists $c \in (a, b)$ with $h'(c) = 0$, which is exactly the Cauchy condition. $\blacksquare$
Proof of L’Hôpital’s Rule ($0/0$ case). Assume $f(a) = g(a) = 0$ and $\lim_{x \to a}\frac{f'(x)}{g'(x)} = L$.
For any $x \neq a$ near $a$, apply the Cauchy MVT to $f$ and $g$ on the interval between $a$ and $x$. This gives a point $c$ strictly between $a$ and $x$ satisfying:
$$\frac{f'(c)}{g'(c)} = \frac{f(x) - f(a)}{g(x) - g(a)} = \frac{f(x)}{g(x)}.$$
Since $c$ lies strictly between $a$ and $x$, we have $|c - a| < |x - a|$. As $x \to a$, the point $c$ is squeezed toward $a$ as well. Therefore:
$$\lim_{x \to a}\frac{f(x)}{g(x)} = \lim_{x \to a}\frac{f'(c)}{g'(c)} = \lim_{c \to a}\frac{f'(c)}{g'(c)} = L. \quad \blacksquare$$
The CMVT converts the ratio $f(x)/g(x)$ into $f'(c)/g'(c)$ at some intermediate point, and squeezing forces $c \to a$ when $x \to a$. This is what makes the step from ratio of functions to ratio of derivatives legitimate - not just an approximation, but an exact equality at each $x$ via the intermediate point $c$.
Warning
Apply L’Hôpital only to $0/0$ or $\infty/\infty$ forms. Not to $1/0$ or $3/0$. And differentiate numerator and denominator separately - the quotient rule does not apply here.
If the resulting $f'(x)/g'(x)$ is again indeterminate, apply L’Hôpital again. But if $\lim f'(x)/g'(x)$ does not exist, that does not mean the original limit does not exist - the rule only applies when the derivative limit exists.
Example 1. $\lim_{x \to 0}\frac{\sin x}{x}$. Form: $0/0$.
$$\lim_{x \to 0}\frac{\sin x}{x} = \lim_{x \to 0}\frac{\cos x}{1} = 1.$$
(Using $(\sin x)' = \cos x$ here and then citing L’Hôpital to prove $(\sin x)' = \cos x$ would be circular. The limit was proved rigorously via the Squeeze Theorem in the limits post. L’Hôpital gives an independent confirmation.)
Example 2. $\lim_{x \to \infty}\frac{x}{e^x}$. Form: $\infty/\infty$.
$$\lim_{x \to \infty}\frac{x}{e^x} = \lim_{x \to \infty}\frac{1}{e^x} = 0.$$
Exponentials beat polynomials. Apply repeatedly for $x^n/e^x$: $n$ applications give $n!/e^x \to 0$.
Example 3. $\lim_{x \to 0^+} x \ln x$. Form: $0 \cdot (-\infty)$. Rewrite as $\frac{\ln x}{1/x}$, form $\infty/\infty$:
$$\lim_{x \to 0^+}\frac{\ln x}{1/x} = \lim_{x \to 0^+}\frac{1/x}{-1/x^2} = \lim_{x \to 0^+}(-x) = 0.$$
The product $x \ln x \to 0$ as $x \to 0^+$. This appears in information-theoretic entropy: $0 \ln 0 = 0$ by convention, justified by this limit.
Example 4. $\lim_{x \to 0^+} x^x$. Form: $0^0$.
Write $x^x = e^{x \ln x}$. From Example 3, $x \ln x \to 0$. Since $e^u$ is continuous: $\lim_{x \to 0^+} x^x = e^0 = 1$.
Exponential indeterminate forms: $1^\infty$, $0^0$, $\infty^0$. These occur when both the base and exponent vary and neither dominates. The standard maneuver: write $f(x)^{g(x)} = e^{g(x)\ln f(x)}$. The exponent $g(x)\ln f(x)$ is now a $0 \cdot \infty$ or $\infty \cdot 0$ form - rewrite it as a ratio and apply L’Hôpital, then exponentiate the result.
Example: $\lim_{x \to \infty}(1 + 1/x)^x$ (form $1^\infty$). The exponent $x\ln(1+1/x)$ is $\infty \cdot 0$; rewrite as $\frac{\ln(1+1/x)}{1/x}$, which is $0/0$. L’Hôpital gives $1$. The limit is $e^1 = e$. This recovers the compound interest definition of $e$ from a single application of L’Hôpital.
Section 11: The Mean Value Theorem
The MVT is the rigorous version of “if you average 60 mph over a trip, your speedometer must read exactly 60 mph at some instant.”
Theorem (Rolle’s Theorem). If $f$ is continuous on $[a, b]$, differentiable on $(a, b)$, and $f(a) = f(b)$, then there exists $c \in (a, b)$ with $f'(c) = 0$.
Proof. By the Extreme Value Theorem, $f$ attains its maximum and minimum on $[a, b]$. If both occur at endpoints, then since $f(a) = f(b)$, the function is constant and $f' = 0$ everywhere. Otherwise, at least one extremum is at an interior point $c$. At such a point, the difference quotient $\frac{f(c+h)-f(c)}{h}$ is $\leq 0$ for $h > 0$ and $\geq 0$ for $h < 0$, so the limit is simultaneously $\leq 0$ and $\geq 0$, giving $f'(c) = 0$. $\blacksquare$
Theorem (Mean Value Theorem). If $f$ is continuous on $[a, b]$ and differentiable on $(a, b)$, then there exists $c \in (a, b)$ with:
$$f'(c) = \frac{f(b) - f(a)}{b - a}.$$
Proof. Apply Rolle to $g(x) = f(x) - \left[f(a) + \frac{f(b)-f(a)}{b-a}(x-a)\right]$, the function that subtracts the secant line from $f$. Then $g(a) = g(b) = 0$, so Rolle gives $c$ with $g'(c) = 0$, which means $f'(c) = \frac{f(b)-f(a)}{b-a}$. $\blacksquare$
Consequences.
- If $f' > 0$ on $(a, b)$: $f$ is strictly increasing.
- If $f' = 0$ everywhere: $f$ is constant.
- If $|f'(x)| \leq M$: then $|f(x) - f(y)| \leq M|x - y|$ for all $x, y$. This is the Lipschitz condition. Bounded derivative implies Lipschitz continuity.
The MVT is the reason gradient descent can be analyzed: if we bound $|\nabla L|$, we control how much the loss can change per step.
Recognizing Which Tool to Use
Differentiation offers several techniques, and problems in the wild do not announce which one applies. The skill is reading the structure of the expression and matching it to the right tool.
Chain rule. Signal: a function applied to a non-trivial function. You see something-of-something: $\sin(x^2)$, $e^{x^3}$, $\sqrt{x^2+1}$, $(3x+1)^{10}$. Ask: can I name an outer function and an inner function? If yes, chain rule. Differentiate the outer (evaluate at the inner), multiply by the derivative of the inner.
Product rule. Signal: two non-constant expressions multiplied together: $x^2\sin x$, $e^x\ln x$. The quotient rule is not a separate rule - write $f/g = f \cdot g^{-1}$ and apply product plus chain rule. Both give $\frac{f’g - fg'}{g^2}$, so you only need to remember one idea.
Implicit differentiation. Signal: an equation relating $x$ and $y$ where you cannot (or should not) isolate $y$ explicitly. Also: any time you want the derivative of an inverse function. Differentiate both sides with respect to $x$. Every $y$ term acquires a factor of $\frac{dy}{dx}$ from the chain rule ($y$ is a function of $x$ even without a formula). Collect those terms and solve.
Logarithmic differentiation. Signal: variable appears in both base and exponent ($x^x$, $(\cos x)^{x^2}$), or the expression is a complicated product/quotient/power where taking $\ln$ would simplify the structure. Take $\ln$ of both sides - $\ln$ converts multiplication to addition, division to subtraction, powers to multipliers - differentiate, then multiply through by $y$.
L’Hôpital’s rule. Signal: a limit evaluates to $0/0$ or $\infty/\infty$. Differentiate numerator and denominator separately (not the quotient rule - separately). For $0 \cdot \infty$: rewrite as a ratio ($f/(1/g)$ or $g/(1/f)$) to produce $0/0$ or $\infty/\infty$. For $1^\infty$, $0^0$, $\infty^0$: write the expression as $e^{(\text{exponent})\ln(\text{base})}$, reduce the exponent to $0 \cdot \infty$, apply L’Hôpital, then exponentiate.
The underlying theme: each tool exists because the algebra of the expression has a specific structure. When you can read that structure, the tool is obvious - not because you memorized rules, but because you understand what each rule does.
Section 12: The Bridge to Multivariable Calculus
Everything so far has been for $f: \mathbb{R} \to \mathbb{R}$. One input, one output, one derivative.
But most interesting functions have many inputs. A neural network loss $L(w_1, w_2, \ldots, w_n)$ has millions. A physics simulation depends on positions and velocities of many particles. A probability model may depend on hundreds of parameters. The question “how fast is the output changing right now?” becomes “how fast is the output changing with respect to each of the million inputs?”
This is not merely an elaboration. The generalization requires genuinely new ideas - and we are now equipped to understand them.
The derivative generalizes as follows:
Partial derivatives. Fix all inputs except one. The partial derivative $\frac{\partial f}{\partial x_i}$ is the rate of change of $f$ with respect to $x_i$ while holding all other inputs fixed. It is literally the single-variable derivative of $t \mapsto f(x_1, \ldots, x_{i-1}, t, x_{i+1}, \ldots, x_n)$ evaluated at $t = x_i$.
The gradient. Collect all partial derivatives into a vector:
$$\nabla f = \left(\frac{\partial f}{\partial x_1},\ \frac{\partial f}{\partial x_2},\ \ldots,\ \frac{\partial f}{\partial x_n}\right).$$
This is the multivariable analog of $f'$. It points in the direction of steepest ascent of $f$ at the current point.
The Jacobian. For a vector-valued function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, the Jacobian is the $m \times n$ matrix of all partial derivatives $J_{ij} = \frac{\partial f_i}{\partial x_j}$. It is the matrix analog of the derivative.
The Hessian. The matrix of second partial derivatives $H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$. It generalizes $f''$ and encodes the curvature of a multivariable function.
We will develop all of this in the multivariate calculus posts. For now, just know: the derivative of a single-variable function is the atom from which all of multivariable calculus is built. The pattern is always the same - limit of a difference quotient - and the intuitions transfer.
Section 13: Derivatives in Machine Learning
The derivative is not a curiosity of classical mathematics. It is the engine of modern ML.
Gradient Descent
The central training algorithm in ML. Given a loss $L(\mathbf{w})$ as a function of the model’s weights, gradient descent updates:
$$\mathbf{w} \leftarrow \mathbf{w} - \eta \nabla L(\mathbf{w}),$$
where $\eta > 0$ is the learning rate. The gradient $\nabla L$ points uphill; we move downhill. The derivative tells you which way is downhill.
The update is, essentially, a linearization: we approximate the loss near the current weights by its tangent hyperplane and move to the minimum of that approximation. If the learning rate is small enough (smaller than the reciprocal of the Lipschitz constant of $\nabla L$, by the MVT), the loss decreases at each step.
Backpropagation
A neural network is a composition of functions:
$$\text{output} = f_n(f_{n-1}(\cdots f_2(f_1(\mathbf{x}, \mathbf{w_1}), \mathbf{w_2})\cdots, \mathbf{w_{n-1}}), \mathbf{w_n}).$$
To minimize the loss, we need $\frac{\partial L}{\partial w_{ij}}$ for every weight in every layer. Backpropagation computes these efficiently using the chain rule: differentiate the outer layer, multiply by the derivative of the next layer, and so on backward through the network.
For a network with $n$ layers, the chain rule gives:
$$\frac{\partial L}{\partial \mathbf{w_1}} = \frac{\partial L}{\partial \mathbf{a_n}} \cdot \frac{\partial \mathbf{a_n}}{\partial \mathbf{a_{n-1}}} \cdots \frac{\partial \mathbf{a_2}}{\partial \mathbf{a_1}} \cdot \frac{\partial \mathbf{a_1}}{\partial \mathbf{w_1}}.$$
The remarkable fact: reverse-mode autodiff (backpropagation) computes all partial derivatives $\frac{\partial L}{\partial w_i}$ simultaneously at a cost roughly equal to the cost of the forward pass. This is because the intermediate Jacobians can be accumulated efficiently as we walk backward through the computation graph.
The Sigmoid and Its Derivative
The logistic (sigmoid) function is:
$$\sigma(x) = \frac{1}{1 + e^{-x}}.$$
Its derivative, computed via the quotient rule:
$$\sigma'(x) = \frac{e^{-x}}{(1+e^{-x})^2} = \frac{1}{1+e^{-x}} \cdot \frac{e^{-x}}{1+e^{-x}} = \sigma(x)(1-\sigma(x)).$$
This formula $\sigma' = \sigma(1-\sigma)$ is elegant: the derivative is expressed in terms of the function’s own value. If $\sigma(x) = 0.7$, then $\sigma'(x) = 0.7 \cdot 0.3 = 0.21$. No additional computation is needed once the forward pass is done.
Activation Functions and Vanishing Gradients
ReLU: $f(x) = \max(0, x)$. Derivative: $f'(x) = 1$ for $x > 0$, $f'(x) = 0$ for $x < 0$. Nondifferentiable at $0$, handled by subgradient.
The sigmoid derivative $\sigma(1-\sigma) \leq 1/4$ everywhere (maximized at $\sigma = 1/2$). In a network with 50 sigmoid layers, the gradient of the loss with respect to the first layer passes through 50 multiplications by $\sigma'(x_i) \leq 1/4$. The product can be as small as $(1/4)^{50} \approx 10^{-30}$ - numerically zero. Gradients vanish. This is why deep networks trained with sigmoid were difficult before skip connections and better initialization.
ReLU solves this: its derivative is $1$ on the positive side, so gradients pass through unchanged. That is the main reason ReLU became the default activation in deep learning.
The Rigorous Underpinning
Differentiability at a Point
$f$ is differentiable at $a$ if the limit $\lim_{h \to 0}\frac{f(a+h)-f(a)}{h}$ exists and is finite. Equivalently: there exists a number $L$ such that
$$f(a+h) = f(a) + Lh + o(h) \quad \text{as } h \to 0,$$
where $o(h)$ denotes a term satisfying $o(h)/h \to 0$. The number $L$ is $f'(a)$. This “little-oh” formulation generalizes cleanly to higher dimensions.
Proof of the Chain Rule
Let $u = g(x)$ and $y = f(u)$. Define the helper function:
$$\phi(k) = \begin{cases} \frac{f(u+k) - f(u)}{k} - f'(u) & k \neq 0 \\ 0 & k = 0. \end{cases}$$
Then $f(u+k) - f(u) = (f'(u) + \phi(k))k$, and $\phi(k) \to 0$ as $k \to 0$ (since $f$ is differentiable at $u$). Setting $k = g(x+h) - g(x)$:
$$f(g(x+h)) - f(g(x)) = [f'(g(x)) + \phi(g(x+h)-g(x))] \cdot [g(x+h) - g(x)].$$
Divide by $h$ and take $h \to 0$: the second bracket divided by $h$ converges to $g'(x)$; the first bracket converges to $f'(g(x))$ because $g(x+h) - g(x) \to 0$ (by continuity of $g$) and $\phi \to 0$ as its argument $\to 0$. $\blacksquare$
The Mean Value Theorem (Revisited)
The MVT is not just a theorem about speedometers. It is the engine of most calculus proofs. It converts information about derivatives into information about function values. Lipschitz conditions, monotonicity, the fundamental theorem of calculus, the error bounds in Taylor’s theorem - all flow through the MVT.
Taylor’s Theorem (Preview)
The derivative gives the best linear approximation. The second derivative gives the best quadratic approximation. More generally, if $f$ is $n$ times differentiable at $a$:
$$f(a + h) = f(a) + f'(a)h + \frac{f''(a)}{2!}h^2 + \cdots + \frac{f^{(n)}(a)}{n!}h^n + R_n(h),$$
where the remainder $R_n(h) = o(h^n)$. This is the Taylor polynomial approximation. We will develop Taylor series - infinite-degree Taylor polynomials - in a dedicated post, and discover that most standard functions are equal to their Taylor series on their entire domain of definition.
Summary Table
| Concept | Definition / Formula | Geometric meaning |
|---|---|---|
| Derivative $f'(a)$ | $\lim_{h \to 0}\frac{f(a+h)-f(a)}{h}$ | Slope of tangent line at $a$ |
| Chain rule | $(f \circ g)' = f'(g) \cdot g'$ | Rate of composition = product of rates |
| Product rule | $(fg)' = f’g + fg'$ | How a product changes |
| $f'(a) = 0$ | Critical point | Tangent is horizontal |
| $f''(a) > 0$ | Second derivative positive | Concave up; local min candidate |
| $f''(a) < 0$ | Second derivative negative | Concave down; local max candidate |
| Linearization | $f(a+h) \approx f(a) + f'(a)h$ | Best linear approximation near $a$ |
| MVT | $\exists c: f'(c) = \frac{f(b)-f(a)}{b-a}$ | Average rate equals instantaneous rate somewhere |
| L’Hopital | $\lim \frac{f}{g} = \lim \frac{f'}{g'}$ (when $0/0$ or $\infty/\infty$) | Derivatives resolve indeterminate limits |
Read next: