Prerequisite:


A Taylor series is a way of representing a smooth function as an infinite sum of polynomial terms built from its derivatives at a single point. This might seem like a curiosity, but it is one of the most practically useful ideas in all of analysis - powering everything from numerical solvers to the activation functions studied in neural networks.

Taylor Polynomials

Given a function $f$ that is $n$ times differentiable at a point $a$, the degree-$n$ Taylor polynomial of $f$ centered at $a$ is

$$P_n(x) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(x - a)^k$$

where $f^{(k)}$ denotes the $k$-th derivative of $f$ and $f^{(0)} = f$. This polynomial is uniquely determined by the requirement that $P_n^{(k)}(a) = f^{(k)}(a)$ for every $k = 0, 1, \ldots, n$. Geometrically, $P_1(x) = f(a) + f'(a)(x-a)$ is the tangent line, $P_2$ curves to match concavity, and higher degrees add progressively finer corrections.

f(x)
  |       /  true f
  |      / ~~~~~
  |     /~~      ~~~
  |    /   P_2 (parabola)
  |   / ----
  |  /--     P_1 (tangent line)
  | /
  |/
  +----------------------------> x
       a

Taylor’s Theorem with Remainder

The key theorem that makes Taylor polynomials more than a formal exercise is:

Theorem (Taylor’s Theorem). Let $f$ be $(n+1)$ times continuously differentiable on an open interval containing $a$ and $x$. Then

$$f(x) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(x-a)^k + R_n(x)$$

where $R_n(x)$ is the remainder (error) of the approximation.

Lagrange Form of the Remainder

The most useful form of the remainder is due to Lagrange. Under the same hypotheses,

$$R_n(x) = \frac{f^{(n+1)}(c)}{(n+1)!}(x - a)^{n+1}$$

for some $c$ strictly between $a$ and $x$ (existence guaranteed by a generalized mean value theorem). The point $c$ is not known explicitly, but the formula lets us bound the error: if $|f^{(n+1)}(t)| \leq M$ for all $t$ between $a$ and $x$, then

$$|R_n(x)| \leq \frac{M}{(n+1)!}|x - a|^{n+1}$$

This bound shrinks as $n$ increases (for fixed $x$ and well-behaved $f$), confirming that higher-degree polynomials give better approximations.

Standard Maclaurin Series

A Maclaurin series is a Taylor series centered at $a = 0$. The following are valid on the stated domains.

Exponential function (entire): $$e^x = \sum_{k=0}^{\infty} \frac{x^k}{k!} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots, \quad x \in \mathbb{R}$$

Sine (entire): $$\sin x = \sum_{k=0}^{\infty} \frac{(-1)^k x^{2k+1}}{(2k+1)!} = x - \frac{x^3}{6} + \frac{x^5}{120} - \cdots$$

Cosine (entire): $$\cos x = \sum_{k=0}^{\infty} \frac{(-1)^k x^{2k}}{(2k)!} = 1 - \frac{x^2}{2} + \frac{x^4}{24} - \cdots$$

Natural logarithm (radius of convergence 1): $$\ln(1+x) = \sum_{k=1}^{\infty} \frac{(-1)^{k+1} x^k}{k} = x - \frac{x^2}{2} + \frac{x^3}{3} - \cdots, \quad x \in (-1, 1]$$

Geometric series (radius of convergence 1): $$\frac{1}{1-x} = \sum_{k=0}^{\infty} x^k = 1 + x + x^2 + x^3 + \cdots, \quad x \in (-1, 1)$$

Binomial series (generalized binomial, $\alpha \in \mathbb{R}$): $$(1+x)^\alpha = \sum_{k=0}^{\infty} \binom{\alpha}{k} x^k, \quad \binom{\alpha}{k} = \frac{\alpha(\alpha-1)\cdots(\alpha-k+1)}{k!}$$

valid for $|x| < 1$ (and at endpoints depending on $\alpha$). When $\alpha$ is a non-negative integer this reduces to the finite binomial theorem.

Radius of Convergence

An infinite series $\sum_{k=0}^\infty c_k (x-a)^k$ converges on some interval centered at $a$. The radius of convergence $R$ is defined by the Cauchy–Hadamard formula:

$$\frac{1}{R} = \limsup_{k \to \infty} |c_k|^{1/k}$$

Equivalently, by the ratio test, if $\lim_{k\to\infty} |c_{k+1}/c_k|$ exists, then $R = \lim_{k\to\infty} |c_k/c_{k+1}|$. The series converges absolutely for $|x - a| < R$ and diverges for $|x - a| > R$; behavior at the endpoints $x = a \pm R$ must be checked separately.

Theorem (Uniqueness of power series representation). If $\sum_{k=0}^\infty c_k (x-a)^k = f(x)$ for all $x$ in some open interval containing $a$, then $c_k = f^{(k)}(a)/k!$ for every $k$. A function can have at most one power series expansion centered at a given point.

Analytic Functions

A function $f$ is real analytic at $a$ if it equals its Taylor series in some open interval around $a$. Equivalently, $f$ is analytic if and only if for every compact subinterval $[a-r, a+r]$ there exists a constant $C$ such that

$$|f^{(k)}(x)| \leq C^k k! \quad \text{for all } k \geq 0 \text{ and all } x \in [a-r, a+r]$$

All the standard elementary functions ($e^x$, $\sin x$, $\cos x$, $\ln x$, polynomials, rational functions away from poles) are analytic on their domains. Analyticity is strictly stronger than infinite differentiability.

Non-Analytic Smooth Functions

The existence of $C^\infty$ functions that are not analytic is a subtle but important fact. The standard example is

$$f(x) = \begin{cases} e^{-1/x^2} & x \neq 0 \ 0 & x = 0 \end{cases}$$

One can show by induction that $f^{(k)}(0) = 0$ for every $k \geq 0$, so the Taylor series of $f$ at 0 is identically zero. Yet $f(x) > 0$ for all $x \neq 0$, so $f$ is not equal to its Taylor series on any punctured neighborhood of 0. Such functions are called flat at the origin.

This has practical consequences: you cannot, in general, recover a smooth function from its derivatives at a single point. Analytic functions are special.

Error Bounds in Practice

Suppose we approximate $\sin x$ by its degree-5 Taylor polynomial at 0:

$$\sin x \approx x - \frac{x^3}{6} + \frac{x^5}{120}$$

The error is $R_5(x) = \frac{\sin^{(6)}(c)}{6!} x^6 = \frac{-\sin c}{720} x^6$ for some $c$ between 0 and $x$. Since $|\sin c| \leq 1$,

$$|R_5(x)| \leq \frac{|x|^6}{720}$$

For $|x| \leq 0.1$ this gives $|R_5| \leq 10^{-6}/720 \approx 1.4 \times 10^{-9}$, which is machine-precision accuracy for single-precision floats. This is exactly how transcendental functions are implemented in hardware and software.

Multivariable Taylor Expansion

For $f: \mathbb{R}^n \to \mathbb{R}$ that is twice continuously differentiable, the second-order Taylor expansion around $x$ is

$$f(x + h) = f(x) + \nabla f(x)^T h + \frac{1}{2} h^T H(x) h + O(|h|^3)$$

where $\nabla f(x) \in \mathbb{R}^n$ is the gradient and $H(x) \in \mathbb{R}^{n \times n}$ is the Hessian matrix with entries $H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}(x)$.

The full $k$-th order term involves a degree-$k$ symmetric tensor of partial derivatives contracted with $h^{\otimes k}$, but in practice the first two correction terms already encode the essential geometry for optimization: the gradient gives the direction of steepest ascent; the Hessian describes how that direction is curved.

Examples

Newton’s method. Finding a zero of $g: \mathbb{R} \to \mathbb{R}$ by linearizing: $0 \approx g(x_k) + g'(x_k)(x_{k+1} - x_k)$, giving $x_{k+1} = x_k - g(x_k)/g'(x_k)$. This is a degree-1 Taylor approximation to $g$. For optimization (finding $\nabla f = 0$), using the degree-2 Taylor approximation of $f$ yields $x_{k+1} = x_k - H(x_k)^{-1}\nabla f(x_k)$. Quadratic convergence follows from the second-order accuracy of the approximation.

Neural network activations. Activation functions like $\sigma(x) = 1/(1+e^{-x})$ and $\tanh(x)$ are often analyzed via their Taylor series near 0. The linear term $\sigma'(0) \cdot x$ controls the initial signal propagation; the cubic correction $-x^3/6$ for $\tanh$ produces the saturation nonlinearity. Understanding which terms matter at initialization informs weight-initialization schemes.

Automatic differentiation. Forward-mode AD computes the Jacobian-vector product $J_f(x) v$ in a single pass; this is the directional derivative and corresponds to evaluating the first-order Taylor coefficient. Higher-order AD produces higher Taylor coefficients efficiently by propagating dual numbers or hyperdual numbers.


Read Next: