Prerequisite:


The derivative is calculus’s answer to a deceptively simple question: how fast is a quantity changing right now? Average rates of change are easy - distance over time gives average speed. But instantaneous rate of change requires taking an average over an interval of length zero, which makes no sense until you phrase it as a limit.

The Tangent Problem: Limit of Secant Slopes

Fix a function $f$ and a point $a$. The secant line through $(a, f(a))$ and a nearby point $(a+h, f(a+h))$ has slope:

$$m_{\text{sec}} = \frac{f(a+h) - f(a)}{h}.$$

As $h \to 0$, the second point slides toward the first, and the secant line rotates toward the tangent line at $a$.

       f(x)
        |                  * (a+h, f(a+h))
        |                *  |
        |             * /   |
        |          *  /     |
        |       * /  secant |
        |    * /             |
        | * /
   f(a) +--* (a, f(a))
        |   \
        |    tangent (limit of secants)
        +----+----------+----> x
             a         a+h

Definition (Derivative). The derivative of $f$ at $a$ is:

$$f'(a) = \lim_{h \to 0} \frac{f(a+h) - f(a)}{h},$$

provided this limit exists. If it does, $f$ is differentiable at $a$.

The derivative is a number - the slope of the tangent line. The function $f': x \mapsto f'(x)$ assigns this slope to each point where it exists.

Alternative Notations

Leibniz: $\frac{df}{dx}$, $\frac{dy}{dx}$, $\frac{d}{dx}[f(x)]$.

Newton: $\dot{f}$ (for time derivatives in physics).

Lagrange: $f'(x)$, $f''(x)$, $f^{(n)}(x)$.

Euler: $Df$, $D^2 f$.

All mean the same thing. In a multivariable context, the partial derivative $\frac{\partial f}{\partial x}$ differentiates with respect to $x$ holding other variables fixed.

Differentiability Implies Continuity

This is a fundamental theorem, not an intuitive claim. Differentiability is strictly stronger than continuity.

Theorem. If $f$ is differentiable at $a$, then $f$ is continuous at $a$.

Proof. We want to show $\lim_{x \to a} f(x) = f(a)$, i.e., $\lim_{h \to 0} f(a+h) - f(a) = 0$.

Write:

$$f(a+h) - f(a) = \frac{f(a+h) - f(a)}{h} \cdot h.$$

As $h \to 0$, the first factor approaches $f'(a)$ (by assumption of differentiability) and the second factor $h \to 0$. By the product rule for limits:

$$\lim_{h \to 0} [f(a+h) - f(a)] = f'(a) \cdot 0 = 0. \quad \blacksquare$$

The converse fails: $f(x) = |x|$ is continuous at $0$ but not differentiable there. Weierstrass’s nowhere-differentiable function (mentioned in the history post) is continuous everywhere, differentiable nowhere.

Derivative Rules

Rather than computing from the definition every time, we derive rules.

Power Rule

Theorem. If $f(x) = x^n$ for $n \in \mathbb{Z}^+$, then $f'(x) = nx^{n-1}$.

Proof (from definition). Use the binomial theorem:

$$(x+h)^n = x^n + nx^{n-1}h + \binom{n}{2}x^{n-2}h^2 + \cdots + h^n.$$

So:

$$\frac{(x+h)^n - x^n}{h} = nx^{n-1} + \binom{n}{2}x^{n-2}h + \cdots + h^{n-1}.$$

Every term except the first contains $h$ as a factor. As $h \to 0$, all such terms vanish, leaving $nx^{n-1}$. $\blacksquare$

The rule extends to $n \in \mathbb{R}$ (proved via implicit differentiation or logarithmic differentiation): $(x^r)' = rx^{r-1}$.

Product Rule

Theorem. $(fg)'(x) = f'(x)g(x) + f(x)g'(x)$.

Proof. The key trick is to add and subtract $f(x)g(x+h)$:

$$\frac{f(x+h)g(x+h) - f(x)g(x)}{h}$$ $$= \frac{f(x+h)g(x+h) - f(x)g(x+h) + f(x)g(x+h) - f(x)g(x)}{h}$$ $$= \frac{f(x+h) - f(x)}{h} \cdot g(x+h) + f(x) \cdot \frac{g(x+h) - g(x)}{h}.$$

As $h \to 0$: the first fraction $\to f'(x)$; $g(x+h) \to g(x)$ (by continuity, since $g$ is differentiable); the second fraction $\to g'(x)$. $\blacksquare$

Quotient Rule

From the product rule applied to $f = (f/g) \cdot g$:

$$\left(\frac{f}{g}\right)' = \frac{f’g - fg'}{g^2}, \quad g \neq 0.$$

Chain Rule

Theorem. If $g$ is differentiable at $x$ and $f$ is differentiable at $g(x)$, then:

$$(f \circ g)'(x) = f'(g(x)) \cdot g'(x).$$

In Leibniz notation: if $y = f(u)$ and $u = g(x)$, then $\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}$.

The proof requires care (a naive proof has a subtle flaw when $g'(x) = 0$). See the dedicated post on the chain rule and backpropagation for the full treatment.

Derivatives of Basic Functions from First Principles

Derivative of $e^x$

We use the characterization that $e$ is the unique base for which $\lim_{h \to 0} \frac{e^h - 1}{h} = 1$ (this limit is, in fact, one way to define $e$).

$$\frac{d}{dx}e^x = \lim_{h \to 0} \frac{e^{x+h} - e^x}{h} = e^x \cdot \lim_{h \to 0} \frac{e^h - 1}{h} = e^x \cdot 1 = e^x.$$

The exponential function equals its own derivative. This property characterizes $e^x$ uniquely (among functions with $f(0) = 1$), and is the source of its ubiquity in differential equations and probability.

Derivative of $\ln x$

Using the inverse function theorem (if $f$ and $f^{-1}$ are differentiable, $(f^{-1})'(x) = \frac{1}{f'(f^{-1}(x))}$) applied to $e^x$:

$$(\ln x)' = \frac{1}{e^{\ln x}} = \frac{1}{x}, \quad x > 0.$$

Alternatively, differentiate both sides of $e^{\ln x} = x$ using the chain rule.

Derivative of $\sin x$

Using the sum formula $\sin(x + h) = \sin x \cos h + \cos x \sin h$:

$$\frac{\sin(x+h) - \sin x}{h} = \sin x \cdot \frac{\cos h - 1}{h} + \cos x \cdot \frac{\sin h}{h}.$$

Two standard limits: $\lim_{h \to 0} \frac{\sin h}{h} = 1$ (proved via squeeze theorem using the geometric inequality $\cos h < \frac{\sin h}{h} < 1$) and $\lim_{h \to 0} \frac{\cos h - 1}{h} = 0$ (follows from the first). Therefore:

$$(\sin x)' = \sin x \cdot 0 + \cos x \cdot 1 = \cos x.$$

Similarly, $(\cos x)' = -\sin x$.

Higher Derivatives

The derivative of the derivative is the second derivative. If $f'$ is itself differentiable, define:

$$f''(x) = (f')'(x) = \frac{d^2f}{dx^2} = \frac{d}{dx}\left(\frac{df}{dx}\right).$$

More generally, the $n$-th derivative $f^{(n)}(x)$ or $\frac{d^n f}{dx^n}$.

  • $f'$ measures rate of change of $f$.
  • $f''$ measures rate of change of $f'$, i.e., the curvature or concavity of $f$.

In physics: position $s(t)$, velocity $v(t) = \dot{s}(t)$, acceleration $a(t) = \ddot{s}(t) = s''(t)$.

Mean Value Theorem

The MVT is one of the most important theorems in calculus. It gives a rigorous version of the claim “if you average 60 mph over a trip, you must have been going exactly 60 mph at some instant.”

Theorem (Rolle’s Theorem). If $f$ is continuous on $[a, b]$, differentiable on $(a, b)$, and $f(a) = f(b)$, then there exists $c \in (a, b)$ such that $f'(c) = 0$.

Proof. By the Extreme Value Theorem, $f$ attains a maximum and minimum on $[a, b]$. If both are attained at the endpoints, then $f$ is constant (since $f(a) = f(b)$) and $f' = 0$ everywhere. Otherwise, at least one extremum is in the interior at some point $c$. At an interior extremum, $f'(c) = 0$. (Proof: if $f$ has a maximum at $c \in (a, b)$, then the difference quotient $\frac{f(c+h)-f(c)}{h}$ is $\leq 0$ for $h > 0$ and $\geq 0$ for $h < 0$; the limit must be simultaneously $\leq 0$ and $\geq 0$, so $f'(c) = 0$.) $\blacksquare$

Theorem (Mean Value Theorem). If $f$ is continuous on $[a, b]$ and differentiable on $(a, b)$, then there exists $c \in (a, b)$ such that:

$$f'(c) = \frac{f(b) - f(a)}{b - a}.$$

Proof. Apply Rolle’s theorem to the auxiliary function:

$$g(x) = f(x) - \left[f(a) + \frac{f(b) - f(a)}{b - a}(x - a)\right].$$

This subtracts the secant line from $f$. Then $g(a) = g(b) = 0$, so Rolle gives $c$ with $g'(c) = 0$, which means $f'(c) = \frac{f(b)-f(a)}{b-a}$. $\blacksquare$

Consequences of MVT:

  • If $f' > 0$ on $(a, b)$, then $f$ is strictly increasing on $[a, b]$.
  • If $f' = 0$ everywhere, then $f$ is constant.
  • If $f'$ is bounded: $|f'(x)| \leq M$ for all $x$, then $|f(x) - f(y)| \leq M|x - y|$ (Lipschitz!).

Monotonicity and Concavity

Monotonicity is governed by the sign of $f'$:

$f'(x)$ on $(a, b)$ Behavior of $f$ on $[a, b]$
$> 0$ Strictly increasing
$< 0$ Strictly decreasing
$= 0$ Constant

Concavity is governed by the sign of $f''$:

$f''(x)$ on $(a, b)$ Behavior
$> 0$ Concave up (bowl-shaped, $f'$ increasing)
$< 0$ Concave down ($f'$ decreasing)

An inflection point is where the concavity changes sign.

Critical Points and the Second Derivative Test

A critical point of $f$ is a point where $f'(c) = 0$ or $f'(c)$ does not exist.

Theorem (Second Derivative Test). Let $f'(c) = 0$.

  • If $f''(c) > 0$: $f$ has a local minimum at $c$.
  • If $f''(c) < 0$: $f$ has a local maximum at $c$.
  • If $f''(c) = 0$: the test is inconclusive.

Proof sketch. If $f''(c) > 0$, then $f'$ is increasing near $c$. Since $f'(c) = 0$, we have $f' < 0$ just left of $c$ and $f' > 0$ just right of $c$, so $f$ is decreasing then increasing - a local minimum. $\blacksquare$

When $f''(c) = 0$, anything can happen: $f(x) = x^4$ has a minimum at $0$ with $f''(0) = 0$; $f(x) = x^3$ has an inflection point at $0$ with $f'(0) = f''(0) = 0$.

L’Hôpital’s Rule

Theorem (L’Hôpital’s Rule). If $\lim_{x \to a} f(x) = \lim_{x \to a} g(x) = 0$ (or both $= \pm\infty$), and $g'(x) \neq 0$ near $a$, and $\lim_{x \to a} \frac{f'(x)}{g'(x)}$ exists, then:

$$\lim_{x \to a} \frac{f(x)}{g(x)} = \lim_{x \to a} \frac{f'(x)}{g'(x)}.$$

When it applies: The indeterminate forms $\frac{0}{0}$ and $\frac{\infty}{\infty}$ only. Not $\frac{1}{0}$, not $\frac{3}{0}$.

Common errors:

  • Applying it when the form is not indeterminate: $\lim_{x \to 0} \frac{x+1}{\sin x}$ is not $\frac{0}{0}$; plugging in gives $\frac{1}{0}$, which is $\infty$, not a candidate for L’Hôpital.
  • Differentiating the ratio as a quotient (use the quotient rule) instead of differentiating numerator and denominator separately.
  • Applying repeatedly without checking the condition holds each time.

Example: $\lim_{x \to 0} \frac{\sin x}{x}$. Both numerator and denominator $\to 0$. Applying L’Hôpital:

$$\lim_{x \to 0} \frac{\sin x}{x} = \lim_{x \to 0} \frac{\cos x}{1} = 1.$$

(Though we already know this from the squeeze theorem argument - which, importantly, is how we proved $(\sin x)' = \cos x$ in the first place. Using L’Hôpital to prove $(\sin x)' = \cos x$ and then using $(\sin x)' = \cos x$ to apply L’Hôpital to $\frac{\sin x}{x}$ is circular.)

The CS Perspective: Automatic Differentiation

In neural network training, you compute gradients of a loss function with respect to millions of parameters. Doing this by hand, or by numerical finite differences, would be impractical. Automatic differentiation (autodiff) is the exact, efficient computation of derivatives using the chain rule.

Every computation in a neural network is a composition of elementary functions: additions, multiplications, $\exp$, $\log$, etc. Autodiff builds a computation graph where each node is an elementary operation, and backpropagation traverses this graph in reverse, applying the chain rule at each step.

The result is not an approximation. Given $y = f(g(h(x)))$:

$$\frac{dy}{dx} = f'(g(h(x))) \cdot g'(h(x)) \cdot h'(x).$$

Autodiff computes this exactly (up to floating-point rounding). The cost of computing all partial derivatives $\frac{\partial L}{\partial \theta_i}$ simultaneously via reverse-mode autodiff (backprop) is roughly the same as computing the forward pass - a remarkable fact that is non-obvious and worth understanding deeply.

Gradient as derivative in higher dimensions. For a scalar function $f: \mathbb{R}^n \to \mathbb{R}$, the derivative generalizes to the gradient $\nabla f = \left(\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}\right)$. This is a vector pointing in the direction of steepest ascent. Gradient descent moves opposite to this direction: $\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}(\theta)$.

The MVT has a multivariate analog: for a differentiable $f: \mathbb{R}^n \to \mathbb{R}$, there exists $c$ on the line segment between $x$ and $y$ such that $f(y) - f(x) = \nabla f(c) \cdot (y - x)$. This is used in convergence proofs for gradient descent: if $|\nabla f|$ is bounded, the function cannot change too fast, which controls how large learning rate steps can be.


The derivative gives us local information about a function: slope, rate of change, direction of increase. In the next post, we pull back to see the chain rule in full generality - and why it is, essentially, the entire story of backpropagation.


Read Next: