Helpful context:


Every physicist knows that for small angles, $\sin(\theta) \approx \theta$.

If you are analyzing the motion of a pendulum, you replace $\sin(\theta)$ with $\theta$ and suddenly the differential equation becomes solvable. Engineers do the same thing when they linearize around an operating point. Astronomers use it for small angular corrections. It is everywhere.

But wait. $\sin(\theta)$ is a transcendental function - it is defined via the geometry of circles, it oscillates between -1 and 1, it has infinitely complex structure. How does a straight line $\theta$ approximate it? Why does this work at all? And how accurate is it, exactly?

If you have never thought carefully about this, it probably feels like an engineering trick rather than a mathematical fact. It is not a trick. It is an instance of a profound idea: if you know a smooth function’s value and all its derivatives at a single point, you can reconstruct the function everywhere nearby - and for many functions, everywhere period. The approximation $\sin(\theta) \approx \theta$ is the first-order Taylor approximation. The full Taylor series is the complete reconstruction.

This post builds that idea from scratch. By the end, you will understand not just the formula, but why the coefficients have factorials in them, why the series sometimes fails to represent the function everywhere, and why the entire framework is the foundation of scientific computing, optimization theory, and neural network analysis.


Section 1: The Problem of Local Reconstruction

A smooth function is one that is infinitely differentiable - you can take the derivative as many times as you like and always get another well-defined function. $\sin x$, $e^x$, and polynomials are smooth; $|x|$ is not (its derivative has a corner at 0).

Suppose someone hands you a smooth function $f$ and tells you only its value at $x = 0$: $f(0) = 3$. You know nothing else. What is $f(0.01)$?

You cannot answer. The function could do anything.

Now they also tell you $f'(0) = 2$. Now you know the function’s value and slope at $x = 0$. The tangent line is $y = 3 + 2x$. Near $x = 0$, the function approximately follows this line. So $f(0.01) \approx 3 + 2(0.01) = 3.02$. Better, but still an approximation - you have no information about how the function curves.

They also tell you $f''(0) = -6$. Now you know the value, slope, and curvature at $x = 0$. You can build a quadratic that matches all three.

This is the pattern. Each additional derivative gives you one more piece of local information, and you can incorporate it into a better approximating polynomial. The question is: if you know all the derivatives at $x = 0$ - infinitely many of them - can you exactly reconstruct $f$?

For most functions you will ever meet, the answer is yes. The Taylor series is the reconstruction.


Section 2: Building the Approximation Step by Step

Let $f$ be a smooth function. We want to build a polynomial $P(x)$ that matches $f$ at $x = a$ as closely as possible.

Degree 0: match the value.

The simplest approximation: $P_0(x) = f(a)$. This is a constant - it matches $f$ at exactly one point and knows nothing about what happens nearby.

Degree 1: match the value and slope.

We want $P_1(x) = c_0 + c_1(x - a)$ such that $P_1(a) = f(a)$ and $P_1'(a) = f'(a)$.

From the first condition: $c_0 = f(a)$.

From the second: $P_1'(x) = c_1$, so $c_1 = f'(a)$.

Therefore:

$$P_1(x) = f(a) + f'(a)(x - a).$$

This is the tangent line. The linearization we developed in the derivatives post.

Degree 2: match value, slope, and curvature.

We want $P_2(x) = c_0 + c_1(x-a) + c_2(x-a)^2$ with $P_2(a) = f(a)$, $P_2'(a) = f'(a)$, $P_2''(a) = f''(a)$.

From $P_2(a) = f(a)$: $c_0 = f(a)$.

$P_2'(x) = c_1 + 2c_2(x-a)$, so $P_2'(a) = c_1 = f'(a)$.

$P_2''(x) = 2c_2$, so $P_2''(a) = 2c_2 = f''(a)$, giving $c_2 = f''(a)/2$.

Therefore:

$$P_2(x) = f(a) + f'(a)(x-a) + \frac{f''(a)}{2}(x-a)^2.$$

Discomfort check. Why is there a $2$ in the denominator? Not because of a choice - we are forced into it. We need the second derivative of $P_2$ at $x = a$ to equal $f''(a)$. The second derivative of $c_2(x-a)^2$ is $2c_2$. To make $2c_2 = f''(a)$, we must have $c_2 = f''(a)/2$. The factorial in the denominator is not decoration. It is forced by the algebra of differentiation.

Degree $n$: the general pattern.

Continuing, we want $P_n(x) = \sum_{k=0}^{n} c_k (x-a)^k$ with $P_n^{(k)}(a) = f^{(k)}(a)$ for every $k = 0, 1, \ldots, n$.

Differentiating $c_k(x-a)^k$ exactly $k$ times gives $k! \cdot c_k$ at $x = a$ (everything else vanishes). So:

$$P_n^{(k)}(a) = k! \cdot c_k = f^{(k)}(a) \implies c_k = \frac{f^{(k)}(a)}{k!}.$$

The degree-$n$ Taylor polynomial of $f$ centered at $a$:

$$P_n(x) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(x - a)^k.$$

This polynomial is uniquely determined by the requirement that its first $n$ derivatives match those of $f$ at $x = a$. No freedom, no choices. The coefficients are completely forced.

Discomfort check. We claimed $c_k = f^{(k)}(a)/k!$ because differentiating $c_k(x-a)^k$ $k$ times at $x = a$ gives $k! c_k$. Let us verify for $k = 3$. The term is $c_3(x-a)^3$. First derivative: $3c_3(x-a)^2$. Second: $6c_3(x-a)$. Third: $6c_3 = 3! c_3$. At $x = a$, the $(x-a)$ factors all vanish. Yes: the $k$-th derivative of $(x-a)^k$ at $x = a$ is exactly $k!$. The factorials come from the power rule applied repeatedly.


Section 3: The Taylor Series

If $f$ is infinitely differentiable at $a$, we can take the approximation all the way - infinitely many terms:

$$f(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!}(x - a)^n$$

$$= f(a) + f'(a)(x-a) + \frac{f''(a)}{2!}(x-a)^2 + \frac{f'''(a)}{3!}(x-a)^3 + \cdots$$

This is the Taylor series of $f$ centered at $a$.

The special case $a = 0$ is called the Maclaurin series:

$$f(x) = \sum_{n=0}^{\infty} \frac{f^{(n)}(0)}{n!} x^n = f(0) + f'(0)x + \frac{f''(0)}{2!}x^2 + \cdots$$

Two immediate warnings. First: the Taylor series always exists for infinitely differentiable functions. Writing it down requires no convergence. Second: the Taylor series may or may not converge. And even if it converges, it may or may not converge to $f$. We will return to both points. For now, compute.


Section 4: The Three Fundamental Maclaurin Series

These are the most important Taylor series you will ever meet. Derive them, do not just memorize them.

The exponential function $e^x$.

We need all derivatives of $e^x$ at $x = 0$. Since $(e^x)' = e^x$, every derivative is $e^x$. At $x = 0$: $f^{(n)}(0) = e^0 = 1$ for all $n$.

$$e^x = \sum_{n=0}^{\infty} \frac{1}{n!} x^n = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \frac{x^4}{4!} + \cdots$$

We showed in the sequences post that this series converges for all $x$ (ratio test gives $\rho = 0$). So $e^x$ equals its Taylor series everywhere on $\mathbb{R}$.

One remarkable check: differentiate the series term by term:

$$\frac{d}{dx}\sum_{n=0}^{\infty} \frac{x^n}{n!} = \sum_{n=1}^{\infty} \frac{n x^{n-1}}{n!} = \sum_{n=1}^{\infty} \frac{x^{n-1}}{(n-1)!} = \sum_{m=0}^{\infty} \frac{x^m}{m!}.$$

The derivative of the series is the series itself. The exponential is its own derivative, confirmed at the level of power series. This is not circular reasoning - we proved convergence independently and can differentiate term by term on the interval of convergence.

The sine function $\sin(x)$.

Derivatives of $\sin(x)$: $\sin'(x) = \cos(x)$, $\sin''(x) = -\sin(x)$, $\sin'''(x) = -\cos(x)$, $\sin^{(4)}(x) = \sin(x)$, and the cycle repeats with period 4.

At $x = 0$: $\sin(0) = 0$, $\cos(0) = 1$, $-\sin(0) = 0$, $-\cos(0) = -1$, $\sin(0) = 0$, $\ldots$

The pattern: $0, 1, 0, -1, 0, 1, 0, -1, \ldots$ The even derivatives vanish. The odd derivatives at 0 alternate between $+1$ and $-1$.

$$\sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \cdots = \sum_{n=0}^{\infty} \frac{(-1)^n}{(2n+1)!} x^{2n+1}.$$

This is a purely odd-power series (no even powers), which makes sense: $\sin(x)$ is an odd function, $\sin(-x) = -\sin(x)$.

From this: $\sin(\theta) \approx \theta - \frac{\theta^3}{6}$ for small $\theta$. The leading term $\theta$ is the small-angle approximation. The next correction $-\theta^3/6$ tells you how the approximation degrades as $\theta$ grows. For $\theta = 0.1$ radians: $\sin(0.1) \approx 0.1 - 0.1^3/6 \approx 0.09983$. Actual value: $0.09983$. Four decimal places, from two terms.

The cosine function $\cos(x)$.

By the same analysis: $\cos(0) = 1$, $-\sin(0) = 0$, $-\cos(0) = -1$, $\sin(0) = 0$, $\ldots$

Pattern: $1, 0, -1, 0, 1, 0, \ldots$ Only even powers survive.

$$\cos(x) = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \cdots = \sum_{n=0}^{\infty} \frac{(-1)^n}{(2n)!} x^{2n}.$$

This is an even-power series, matching the fact that $\cos$ is an even function.

Notice: differentiating the sine series term by term gives the cosine series, and differentiating the cosine series gives minus the sine series. The series is consistent with the derivative relationships, as it must be.


Section 5: Euler’s Formula - The Crown Jewel

Here is the point where the Taylor series reveals something that could not be seen from the definitions alone.

What happens when we compute $e^{ix}$, where $i = \sqrt{-1}$?

We use the exponential series, replacing $x$ with $ix$:

$$e^{ix} = \sum_{n=0}^{\infty} \frac{(ix)^n}{n!} = 1 + ix + \frac{(ix)^2}{2!} + \frac{(ix)^3}{3!} + \frac{(ix)^4}{4!} + \cdots$$

Compute the powers of $i$: $i^0 = 1$, $i^1 = i$, $i^2 = -1$, $i^3 = -i$, $i^4 = 1$, and the cycle repeats.

$$e^{ix} = 1 + ix - \frac{x^2}{2!} - \frac{ix^3}{3!} + \frac{x^4}{4!} + \frac{ix^5}{5!} - \cdots$$

Separate real and imaginary parts:

$$e^{ix} = \left(1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \cdots\right) + i\left(x - \frac{x^3}{3!} + \frac{x^5}{5!} - \cdots\right)$$

$$e^{ix} = \cos(x) + i\sin(x).$$

This is Euler’s formula. Setting $x = \pi$:

$$e^{i\pi} = \cos(\pi) + i\sin(\pi) = -1 + 0 = -1.$$

So $e^{i\pi} + 1 = 0$. This single equation connects the five most fundamental constants in mathematics: $e$, $i$, $\pi$, $1$, and $0$.

The connection between the exponential function and the trigonometric functions - which seem to have completely different origins, one from growth/decay and the other from geometry - is revealed through their Taylor series. The series are not just computational tools. They expose structure that is otherwise invisible.


Section 6: Taylor’s Theorem - How Good Is the Approximation?

The Taylor polynomial $P_n(x)$ approximates $f(x)$ near $x = a$. But how close is the approximation? This is not a vague question - it has a precise quantitative answer.

Theorem (Taylor’s Theorem with Lagrange Remainder). If $f$ is $(n+1)$ times continuously differentiable on an interval containing $a$ and $x$, then:

$$f(x) = P_n(x) + R_n(x)$$

where the remainder $R_n(x)$ satisfies:

$$R_n(x) = \frac{f^{(n+1)}(c)}{(n+1)!}(x-a)^{n+1}$$

for some $c$ strictly between $a$ and $x$.

We cannot find $c$ explicitly (it is guaranteed to exist by a generalized mean value theorem). But we do not need to. We only need to bound $|f^{(n+1)}|$. If $|f^{(n+1)}(t)| \leq M$ for all $t$ between $a$ and $x$, then:

$$|R_n(x)| \leq \frac{M}{(n+1)!} |x - a|^{n+1}.$$

This is the key. The error decreases like $|x - a|^{n+1}/(n+1)!$ as $n$ increases. For fixed $x$ and nice functions, this goes to zero - and the Taylor series converges to $f$.

The small-angle approximation. Apply to $\sin(\theta) \approx \theta$.

This is the degree-1 Taylor approximation at $a = 0$. The remainder:

$$|\sin(\theta) - \theta| = |R_1(\theta)| \leq \frac{|\sin''(c)|}{2!}|\theta|^2 = \frac{|\sin c|}{2} |\theta|^2 \leq \frac{|\theta|^2}{2}.$$

For $\theta = 0.1$ radians: error $\leq 0.01/2 = 0.005$. Actual error: $|0.09983 - 0.1| \approx 0.0002$. The bound is conservative, but it is a rigorous guarantee.

For $\theta = 0.3$ radians: error $\leq 0.09/2 = 0.045$. Actual: $|\sin(0.3) - 0.3| \approx 0.295 - 0.3 = 0.004$. Still conservative.

The approximation $\sin(\theta) \approx \theta$ is good to within $\theta^2/2$. This quantifies “good for small $\theta$” - and makes the approximation a rigorous tool, not just a hand-wave.

Improving accuracy. The degree-3 approximation: $\sin(\theta) \approx \theta - \theta^3/6$.

The remainder is now $|R_3(\theta)| \leq |\theta|^5/120$. For $\theta = 0.3$: error $\leq (0.3)^5/120 \approx 0.0000020$. Two extra terms cut the error from $0.045$ to $0.000002$. Taylor series converge fast when $|x - a|$ is small.


Section 7: Computing the Standard Series

Let us work out two more important series.

$\ln(1+x)$.

Derivatives of $\ln(1+x)$: $f(x) = \ln(1+x)$, $f'(x) = 1/(1+x)$, $f''(x) = -1/(1+x)^2$, $f'''(x) = 2/(1+x)^3$, $f^{(n)}(x) = (-1)^{n-1}(n-1)!/(1+x)^n$ for $n \geq 1$.

At $x = 0$: $f^{(n)}(0) = (-1)^{n-1}(n-1)!$ for $n \geq 1$, and $f(0) = 0$.

$$c_n = \frac{f^{(n)}(0)}{n!} = \frac{(-1)^{n-1}(n-1)!}{n!} = \frac{(-1)^{n-1}}{n}.$$

$$\ln(1+x) = x - \frac{x^2}{2} + \frac{x^3}{3} - \frac{x^4}{4} + \cdots = \sum_{n=1}^{\infty} \frac{(-1)^{n-1}}{n} x^n.$$

By the ratio test, the radius of convergence is 1. At $x = 1$, this becomes the alternating harmonic series, which converges to $\ln 2$ (confirming a result from the series post). At $x = -1$, it becomes $-\sum 1/n$, which diverges.

$\frac{1}{1-x}$.

This is the geometric series: $\sum_{n=0}^{\infty} x^n$. Differentiating:

$$\frac{1}{(1-x)^2} = \sum_{n=1}^{\infty} n x^{n-1}.$$

Differentiating again:

$$\frac{2}{(1-x)^3} = \sum_{n=2}^{\infty} n(n-1) x^{n-2}.$$

Term-by-term differentiation of power series generates new power series for related functions within the radius of convergence.


Section 8: Radius of Convergence - When Series Fails

The exponential series $\sum x^n/n!$ converges everywhere. The logarithm series $\sum (-1)^{n-1}x^n/n$ converges only for $|x| \leq 1$ (with different behavior at the endpoints).

Now consider $f(x) = \frac{1}{1+x^2}$. This function is defined and infinitely differentiable on all of $\mathbb{R}$ - it is smooth everywhere, with no problems. So surely its Taylor series at $x = 0$ converges everywhere?

No. The Taylor series converges only for $|x| < 1$.

To see why, use the geometric series formula with $x$ replaced by $-x^2$:

$$\frac{1}{1+x^2} = \frac{1}{1-(-x^2)} = \sum_{n=0}^{\infty} (-x^2)^n = \sum_{n=0}^{\infty} (-1)^n x^{2n},$$

valid only when $|-x^2| = x^2 < 1$, i.e., $|x| < 1$.

Why does a perfectly smooth function have a Taylor series that fails to converge for $|x| \geq 1$? The function is fine there. But the series breaks down.

The answer requires looking at the complex plane. The function $f(z) = \frac{1}{1+z^2}$ has singularities at $z = \pm i$ (the complex square roots of $-1$). These are points where the function blows up in the complex plane - not on the real line, where $1 + x^2 > 0$ always, but in the complex plane. The radius of convergence of the Taylor series at $x = 0$ is exactly the distance from 0 to the nearest singularity in the complex plane: distance from $0$ to $\pm i$ is 1. So $R = 1$.

The real function looks fine, but it has complex singularities at distance 1, and those singularities limit the radius of convergence of the real Taylor series. This is a phenomenon you cannot see without complex analysis - the real behavior of a series is constrained by complex singularities you cannot even see on the real line.

Discomfort check. This is one of the most surprising facts in analysis. A function that is smooth on all of $\mathbb{R}$ can have a Taylor series that diverges for real inputs. The explanation requires the complex plane. The real line is embedded in the complex plane, and singularities off the real line still constrain what happens on it. If you have only seen real analysis, this feels like a violation. It is not - it is a genuine constraint imposed by complex geometry. Complex analysis is the right tool to fully understand convergence of real Taylor series.


Section 9: When a Function Is Not Its Taylor Series

There is an even more disturbing possibility: the Taylor series might converge, but not to the function.

Consider the function:

$$f(x) = \begin{cases} e^{-1/x^2} & x \neq 0 \\ 0 & x = 0 \end{cases}$$

This function is infinitely differentiable everywhere, including at $x = 0$ - you can verify by induction that $f^{(n)}(0) = 0$ for every $n$. (The exponential decay crushes every polynomial as $x \to 0$.)

The Taylor series at $x = 0$ is therefore:

$$\sum_{n=0}^{\infty} \frac{0}{n!} x^n = 0.$$

The identically zero function. The Taylor series converges - to zero - everywhere.

But $f(x) = e^{-1/x^2} > 0$ for every $x \neq 0$. The function is not zero. The Taylor series, which converges everywhere, converges to the wrong function.

Discomfort check. This is the sharpest possible separation between “Taylor series exists” and “Taylor series equals the function.” A function can be infinitely differentiable, can have a Taylor series that converges everywhere, and yet the Taylor series can fail to represent it outside a single point. This means you cannot recover a smooth function from its derivatives at one point alone - unless you know the function is analytic. The Taylor series expresses local information. That local information is not always enough to reconstruct the global function.

Analytic functions. A function is called real analytic at $a$ if it equals its Taylor series in some open interval around $a$. The exponential, sine, cosine, and all the elementary functions are analytic. The function $e^{-1/x^2}$ (extended by 0 at the origin) is smooth but not analytic at the origin. Analyticity is strictly stronger than infinite differentiability.

For analytic functions - the functions you spend most of your time with in practice - the Taylor series is a complete representation. That is why it is so powerful.


Section 10: Taylor Series in Higher Dimensions

The idea of the Taylor series extends naturally to functions of several variables. This is where it connects directly to optimization and machine learning.

Let $f: \mathbb{R}^n \to \mathbb{R}$ be a smooth function, and let $\mathbf{x}$ be a point in $\mathbb{R}^n$. Consider expanding $f(\mathbf{x} + \boldsymbol{\delta})$ around $\mathbf{x}$, where $\boldsymbol{\delta}$ is a small displacement.

First-order (linear) approximation:

$$f(\mathbf{x} + \boldsymbol{\delta}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x}) \cdot \boldsymbol{\delta}.$$

Here $\nabla f(\mathbf{x})$ is the gradient - the vector of partial derivatives. This is the multivariable analogue of the tangent line. It says: the change in $f$ is approximately the dot product of the gradient with the displacement.

This is the foundation of gradient descent. The gradient tells you how $f$ changes when you move in any direction. Moving opposite to the gradient moves downhill.

Second-order (quadratic) approximation:

$$f(\mathbf{x} + \boldsymbol{\delta}) \approx f(\mathbf{x}) + \nabla f(\mathbf{x}) \cdot \boldsymbol{\delta} + \frac{1}{2} \boldsymbol{\delta}^T H(\mathbf{x}) \boldsymbol{\delta}.$$

Here $H(\mathbf{x})$ is the Hessian matrix, with entries $H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$. This is the multivariable analogue of $f''$. The term $\boldsymbol{\delta}^T H \boldsymbol{\delta}$ captures how $f$ curves - whether the curvature is positive (a bowl) or negative (a hill) or mixed.

Why the factor of $1/2$? Same reason as in one dimension: matching the second derivative of the approximation to the second derivative of $f$ requires dividing by $2! = 2$.

Discomfort check. In one variable, $f''(a) > 0$ means the function is concave up at $a$, which looks like a bowl in two dimensions. In $n$ variables, the Hessian is an $n \times n$ matrix. Whether the approximation looks like a bowl depends on whether $H$ is positive definite. To understand that: every square matrix has special directions called eigenvectors, along which the matrix acts by pure scaling. The scaling factors are the eigenvalues. For the Hessian, an eigenvalue in a given direction tells you how sharply $f$ curves in that direction - positive means curves up, negative means curves down. $H$ is positive definite when all its eigenvalues are positive, meaning $f$ curves up in every direction - a true bowl. If some eigenvalues are positive and some negative, the function curves up in some directions and down in others - a saddle point. The Hessian captures this geometry completely, and its eigenstructure determines the local shape of $f$.


Section 11: Why Taylor Series Matter for ML and Optimization

This is not a detour. The Taylor series is the mathematical engine of modern optimization.

Newton’s method. To minimize $f(x)$ in one dimension, we want $f'(x) = 0$. Newton’s method applies a linear (first-order Taylor) approximation to $g(x) = f'(x)$:

$$0 \approx g(x_k) + g'(x_k)(x_{k+1} - x_k) = f'(x_k) + f''(x_k)(x_{k+1} - x_k).$$

Solving for $x_{k+1}$:

$$x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)}.$$

In multiple dimensions, the Newton step for minimizing $f$ is:

$$\mathbf{x_{k+1}} = \mathbf{x_k} - H(\mathbf{x_k})^{-1} \nabla f(\mathbf{x_k}).$$

This is a second-order method: it uses both the gradient (first-order information) and the Hessian (second-order information). The second-order Taylor expansion is the foundation. Newton’s method has quadratic convergence near a minimum - the error squares at each step - precisely because the quadratic approximation is exact to second order.

Gradient descent is first-order. Standard gradient descent uses only $\nabla f$ - it is a first-order method. It ignores the curvature information in $H$. When the curvature is highly non-uniform (some directions very curved, others almost flat), gradient descent is inefficient - it has to use a tiny step size everywhere to avoid overshooting in the highly curved directions.

Second-order methods like Newton’s method or L-BFGS (Limited-memory BFGS - an algorithm that approximates the inverse Hessian from recent gradient history, without ever computing or storing the full $n \times n$ matrix) incorporate curvature information and adapt their step size to the local shape of $f$: larger steps in flat directions, smaller steps in curved ones. They are faster in terms of iterations, at the cost of the extra computation needed to track or approximate $H$.

Loss functions near optima. Near a local minimum $\mathbf{x}^{\ast}$ of a smooth loss $L(\mathbf{w})$, the gradient vanishes: $\nabla L(\mathbf{x}^{\ast}) = 0$. The second-order Taylor expansion gives:

$$L(\mathbf{x}^{\ast} + \boldsymbol{\delta}) \approx L(\mathbf{x}^{\ast}) + \frac{1}{2} \boldsymbol{\delta}^T H(\mathbf{x}^{\ast}) \boldsymbol{\delta}.$$

The behavior of the loss near the minimum is entirely controlled by the Hessian. If $H$ is positive definite, the minimum is a true bowl - nearby perturbed models are worse. The eigenvalues of $H$ determine how quickly the loss grows in each direction. Analysis of generalization, sharpness of minima, and stability of trained models all come back to this quadratic approximation.

Softmax and the exponential. The softmax function $\text{softmax}(z_i) = e^{z_i} / \sum_j e^{z_j}$ uses the exponential, which equals its Taylor series everywhere. The cross-entropy loss $-\log(\text{softmax})$ involves the logarithm, which equals its Taylor series on $(0, \infty)$. Analyzing the behavior of these functions for small logit perturbations uses Taylor expansion. Initializing networks so that activations are in the linear regime of the activation function is a first-order Taylor consideration.

Neural network activations. The sigmoid $\sigma(x) = 1/(1+e^{-x})$ has the Taylor expansion near $x = 0$:

$$\sigma(x) = \frac{1}{2} + \frac{x}{4} - \frac{x^3}{48} + \cdots$$

The linear coefficient $1/4 = \sigma'(0)$ is the initial signal-propagation rate. The cubic correction $-x^3/48$ produces the saturation nonlinearity. Weight initialization schemes (like Glorot initialization) are designed so that, at the start of training, neuron inputs are near 0 and the sigmoid is operating in its approximately linear regime - the first-order Taylor approximation is valid.


Section 12: Using Taylor Series to Compute

Taylor series are not just for analysis - they are how transcendental functions are actually computed in software.

Computing $\sin(x)$ in hardware. A CPU cannot evaluate $\sin$ analytically. It uses a polynomial approximation derived from the Taylor series:

$$\sin(x) \approx x - \frac{x^3}{6} + \frac{x^5}{120} - \frac{x^7}{5040}.$$

The Lagrange remainder tells us the error is bounded by $|x|^9/9!$. For $|x| \leq \pi/4 \approx 0.785$, the error is at most $(0.785)^9/362880 \approx 5 \times 10^{-8}$, well within single-precision floating-point accuracy. Range reduction (using trigonometric identities to bring $x$ into $[-\pi/4, \pi/4]$) makes this work for all inputs.

Numerical stability. Computing $e^x - 1$ for small $x$ is numerically dangerous if done naively: $e^x \approx 1$ for small $x$, and $e^x - 1$ involves catastrophic cancellation (subtracting nearly equal floating-point numbers). The Taylor series gives:

$$e^x - 1 = x + \frac{x^2}{2} + \frac{x^3}{6} + \cdots$$

Computing this sum directly avoids cancellation. This is the expm1 function in every numerical library, and its correctness depends on the Taylor series.

The alternating harmonic series and $\ln 2$. From the Taylor series for $\ln(1+x)$ at $x = 1$:

$$\ln 2 = 1 - \frac{1}{2} + \frac{1}{3} - \frac{1}{4} + \cdots$$

This converges, but slowly (error bounded by $1/(N+1)$ after $N$ terms). To compute $\ln 2$ to 10 decimal places using this series, you need billions of terms. Better approaches exist, but this establishes the connection between the series and the value.


Summary

Concept Definition / Formula Key insight
Degree-$n$ Taylor polynomial $P_n(x) = \sum_{k=0}^{n} \frac{f^{(k)}(a)}{k!}(x-a)^k$ Matches all derivatives of $f$ up to order $n$ at $a$
Factorial in denominator $c_k = f^{(k)}(a)/k!$ forced by differentiation Not a choice - forced by the power rule
Taylor series $\sum_{n=0}^{\infty} \frac{f^{(n)}(a)}{n!}(x-a)^n$ Infinite-degree approximation
Maclaurin series Taylor series at $a = 0$ Simplest center
$e^x$ series $\sum x^n/n!$, converges for all $x$ Its own derivative; converges everywhere
$\sin(x)$ series $\sum (-1)^n x^{2n+1}/(2n+1)!$ Odd powers only; gives small-angle approx
$\cos(x)$ series $\sum (-1)^n x^{2n}/(2n)!$ Even powers only
Euler’s formula $e^{ix} = \cos(x) + i\sin(x)$ From comparing $e^{ix}$ series with $\sin$ and $\cos$
Lagrange remainder $R_n(x) = \frac{f^{(n+1)}(c)}{(n+1)!}(x-a)^{n+1}$ Quantifies approximation error
Radius of convergence Limited by nearest complex singularity Even smooth real functions can fail
Non-analytic smooth function $e^{-1/x^2}$: Taylor series is identically 0 Taylor series can converge to wrong function
Second-order Taylor $f(\mathbf{x}+\boldsymbol{\delta}) \approx f(\mathbf{x}) + \nabla f \cdot \boldsymbol{\delta} + \frac{1}{2}\boldsymbol{\delta}^T H \boldsymbol{\delta}$ Foundation of Newton’s method and curvature analysis

The Taylor series is the tool that makes smooth functions computable, analyzable, and comparable. It says: near any point, a smooth function is a polynomial (approximately). Polynomials are simple. Almost everything in applied mathematics exploits this.


Read Next: