Taylor Series
Prerequisite:
A Taylor series is a way of representing a smooth function as an infinite sum of polynomial terms built from its derivatives at a single point. This might seem like a curiosity, but it is one of the most practically useful ideas in all of analysis - powering everything from numerical solvers to the activation functions studied in neural networks.
Taylor Polynomials
Given a function $f$ that is $n$ times differentiable at a point $a$, the degree-$n$ Taylor polynomial of $f$ centered at $a$ is
$$P_n(x) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(x - a)^k$$
where $f^{(k)}$ denotes the $k$-th derivative of $f$ and $f^{(0)} = f$. This polynomial is uniquely determined by the requirement that $P_n^{(k)}(a) = f^{(k)}(a)$ for every $k = 0, 1, \ldots, n$. Geometrically, $P_1(x) = f(a) + f'(a)(x-a)$ is the tangent line, $P_2$ curves to match concavity, and higher degrees add progressively finer corrections.
f(x)
| / true f
| / ~~~~~
| /~~ ~~~
| / P_2 (parabola)
| / ----
| /-- P_1 (tangent line)
| /
|/
+----------------------------> x
a
Taylor’s Theorem with Remainder
The key theorem that makes Taylor polynomials more than a formal exercise is:
Theorem (Taylor’s Theorem). Let $f$ be $(n+1)$ times continuously differentiable on an open interval containing $a$ and $x$. Then
$$f(x) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(x-a)^k + R_n(x)$$
where $R_n(x)$ is the remainder (error) of the approximation.
Lagrange Form of the Remainder
The most useful form of the remainder is due to Lagrange. Under the same hypotheses,
$$R_n(x) = \frac{f^{(n+1)}(c)}{(n+1)!}(x - a)^{n+1}$$
for some $c$ strictly between $a$ and $x$ (existence guaranteed by a generalized mean value theorem). The point $c$ is not known explicitly, but the formula lets us bound the error: if $|f^{(n+1)}(t)| \leq M$ for all $t$ between $a$ and $x$, then
$$|R_n(x)| \leq \frac{M}{(n+1)!}|x - a|^{n+1}$$
This bound shrinks as $n$ increases (for fixed $x$ and well-behaved $f$), confirming that higher-degree polynomials give better approximations.
Standard Maclaurin Series
A Maclaurin series is a Taylor series centered at $a = 0$. The following are valid on the stated domains.
Exponential function (entire): $$e^x = \sum_{k=0}^{\infty} \frac{x^k}{k!} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots, \quad x \in \mathbb{R}$$
Sine (entire): $$\sin x = \sum_{k=0}^{\infty} \frac{(-1)^k x^{2k+1}}{(2k+1)!} = x - \frac{x^3}{6} + \frac{x^5}{120} - \cdots$$
Cosine (entire): $$\cos x = \sum_{k=0}^{\infty} \frac{(-1)^k x^{2k}}{(2k)!} = 1 - \frac{x^2}{2} + \frac{x^4}{24} - \cdots$$
Natural logarithm (radius of convergence 1): $$\ln(1+x) = \sum_{k=1}^{\infty} \frac{(-1)^{k+1} x^k}{k} = x - \frac{x^2}{2} + \frac{x^3}{3} - \cdots, \quad x \in (-1, 1]$$
Geometric series (radius of convergence 1): $$\frac{1}{1-x} = \sum_{k=0}^{\infty} x^k = 1 + x + x^2 + x^3 + \cdots, \quad x \in (-1, 1)$$
Binomial series (generalized binomial, $\alpha \in \mathbb{R}$): $$(1+x)^\alpha = \sum_{k=0}^{\infty} \binom{\alpha}{k} x^k, \quad \binom{\alpha}{k} = \frac{\alpha(\alpha-1)\cdots(\alpha-k+1)}{k!}$$
valid for $|x| < 1$ (and at endpoints depending on $\alpha$). When $\alpha$ is a non-negative integer this reduces to the finite binomial theorem.
Radius of Convergence
An infinite series $\sum_{k=0}^\infty c_k (x-a)^k$ converges on some interval centered at $a$. The radius of convergence $R$ is defined by the Cauchy–Hadamard formula:
$$\frac{1}{R} = \limsup_{k \to \infty} |c_k|^{1/k}$$
Equivalently, by the ratio test, if $\lim_{k\to\infty} |c_{k+1}/c_k|$ exists, then $R = \lim_{k\to\infty} |c_k/c_{k+1}|$. The series converges absolutely for $|x - a| < R$ and diverges for $|x - a| > R$; behavior at the endpoints $x = a \pm R$ must be checked separately.
Theorem (Uniqueness of power series representation). If $\sum_{k=0}^\infty c_k (x-a)^k = f(x)$ for all $x$ in some open interval containing $a$, then $c_k = f^{(k)}(a)/k!$ for every $k$. A function can have at most one power series expansion centered at a given point.
Analytic Functions
A function $f$ is real analytic at $a$ if it equals its Taylor series in some open interval around $a$. Equivalently, $f$ is analytic if and only if for every compact subinterval $[a-r, a+r]$ there exists a constant $C$ such that
$$|f^{(k)}(x)| \leq C^k k! \quad \text{for all } k \geq 0 \text{ and all } x \in [a-r, a+r]$$
All the standard elementary functions ($e^x$, $\sin x$, $\cos x$, $\ln x$, polynomials, rational functions away from poles) are analytic on their domains. Analyticity is strictly stronger than infinite differentiability.
Non-Analytic Smooth Functions
The existence of $C^\infty$ functions that are not analytic is a subtle but important fact. The standard example is
$$f(x) = \begin{cases} e^{-1/x^2} & x \neq 0 \ 0 & x = 0 \end{cases}$$
One can show by induction that $f^{(k)}(0) = 0$ for every $k \geq 0$, so the Taylor series of $f$ at 0 is identically zero. Yet $f(x) > 0$ for all $x \neq 0$, so $f$ is not equal to its Taylor series on any punctured neighborhood of 0. Such functions are called flat at the origin.
This has practical consequences: you cannot, in general, recover a smooth function from its derivatives at a single point. Analytic functions are special.
Error Bounds in Practice
Suppose we approximate $\sin x$ by its degree-5 Taylor polynomial at 0:
$$\sin x \approx x - \frac{x^3}{6} + \frac{x^5}{120}$$
The error is $R_5(x) = \frac{\sin^{(6)}(c)}{6!} x^6 = \frac{-\sin c}{720} x^6$ for some $c$ between 0 and $x$. Since $|\sin c| \leq 1$,
$$|R_5(x)| \leq \frac{|x|^6}{720}$$
For $|x| \leq 0.1$ this gives $|R_5| \leq 10^{-6}/720 \approx 1.4 \times 10^{-9}$, which is machine-precision accuracy for single-precision floats. This is exactly how transcendental functions are implemented in hardware and software.
Multivariable Taylor Expansion
For $f: \mathbb{R}^n \to \mathbb{R}$ that is twice continuously differentiable, the second-order Taylor expansion around $x$ is
$$f(x + h) = f(x) + \nabla f(x)^T h + \frac{1}{2} h^T H(x) h + O(|h|^3)$$
where $\nabla f(x) \in \mathbb{R}^n$ is the gradient and $H(x) \in \mathbb{R}^{n \times n}$ is the Hessian matrix with entries $H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}(x)$.
The full $k$-th order term involves a degree-$k$ symmetric tensor of partial derivatives contracted with $h^{\otimes k}$, but in practice the first two correction terms already encode the essential geometry for optimization: the gradient gives the direction of steepest ascent; the Hessian describes how that direction is curved.
Examples
Newton’s method. Finding a zero of $g: \mathbb{R} \to \mathbb{R}$ by linearizing: $0 \approx g(x_k) + g'(x_k)(x_{k+1} - x_k)$, giving $x_{k+1} = x_k - g(x_k)/g'(x_k)$. This is a degree-1 Taylor approximation to $g$. For optimization (finding $\nabla f = 0$), using the degree-2 Taylor approximation of $f$ yields $x_{k+1} = x_k - H(x_k)^{-1}\nabla f(x_k)$. Quadratic convergence follows from the second-order accuracy of the approximation.
Neural network activations. Activation functions like $\sigma(x) = 1/(1+e^{-x})$ and $\tanh(x)$ are often analyzed via their Taylor series near 0. The linear term $\sigma'(0) \cdot x$ controls the initial signal propagation; the cubic correction $-x^3/6$ for $\tanh$ produces the saturation nonlinearity. Understanding which terms matter at initialization informs weight-initialization schemes.
Automatic differentiation. Forward-mode AD computes the Jacobian-vector product $J_f(x) v$ in a single pass; this is the directional derivative and corresponds to evaluating the first-order Taylor coefficient. Higher-order AD produces higher Taylor coefficients efficiently by propagating dual numbers or hyperdual numbers.
Read Next: