Integration - Summing Infinitely Many Infinitely Small Pieces
Helpful context:
- Limits & Continuity - What a Function Intends to Do
- Derivatives - The Geometry of Instantaneous Change
You can find the area of a rectangle: width times height. You can find the area of a triangle: half base times height. You can find the area of a circle: $\pi r^2$.
But now someone draws the curve $y = x^2$ and asks: what is the area of the region below this curve, above the $x$-axis, between $x = 0$ and $x = 1$?
The region has no name. No formula in your toolkit applies directly. It is not a triangle - its boundary is curved. It is not a trapezoid. It is not anything you have seen before.
And yet - you can see that the region clearly has some area. It is bounded. It sits there, finite, on the page. The question is not whether the area exists but how to compute it.
This is the area problem, and it is one of the oldest problems in mathematics. Archimedes struggled with it for specific curves around 250 BCE. The answer required waiting nearly two thousand years for Newton and Leibniz to invent calculus - and the answer, when it came, connected areas to slopes in a way nobody expected.
This post builds the theory of integration from scratch: first the intuition of approximation, then the formal definition, then the shocking theorem that ties it all together.
Integration is accumulation. The area problem is the cleanest geometric example, but it is not the only one. Any time a quantity builds up continuously - distance from a varying velocity, work from a varying force, probability from a density function - the underlying operation is the same: you are adding up infinitely many infinitely small contributions. The integral is the mathematical tool that makes this precise. Area under a curve is the simplest case to visualize: imagine the region between the curve and the $x$-axis as being made of infinitely many vertical strips, each with width $dx$ and height $f(x)$. The integral $\int_a^b f(x),dx$ is the total area of all those strips, added up. When you encounter integration in probability (total probability must be 1) or physics (total work done by a force) or information theory (entropy), that is the same idea wearing different clothes.
Section 1: Approximating with Rectangles
If you cannot compute the exact area under $y = x^2$, you can approximate it. And if the approximation improves as you make it finer, the exact area is the limit.
The idea: chop the interval $[0, 1]$ into $n$ equal subintervals of width $\Delta x = 1/n$. Over each subinterval, draw a rectangle whose height is the value of $f(x) = x^2$ at some point in the subinterval. The sum of these rectangle areas approximates the area under the curve.
For $n = 4$ subintervals of width $1/4$, the subintervals are $[0, 1/4]$, $[1/4, 1/2]$, $[1/2, 3/4]$, $[3/4, 1]$.
Left-endpoint sum. Use the left endpoint of each subinterval as the height:
$$L_4 = f(0) \cdot \frac{1}{4} + f\left(\frac{1}{4}\right) \cdot \frac{1}{4} + f\left(\frac{1}{2}\right) \cdot \frac{1}{4} + f\left(\frac{3}{4}\right) \cdot \frac{1}{4}.$$
$$= \left(0 + \frac{1}{16} + \frac{1}{4} + \frac{9}{16}\right) \cdot \frac{1}{4} = \frac{14}{16} \cdot \frac{1}{4} = \frac{14}{64} = \frac{7}{32} \approx 0.219.$$
Right-endpoint sum. Use the right endpoint:
$$R_4 = f\left(\frac{1}{4}\right) \cdot \frac{1}{4} + f\left(\frac{1}{2}\right) \cdot \frac{1}{4} + f\left(\frac{3}{4}\right) \cdot \frac{1}{4} + f(1) \cdot \frac{1}{4}.$$
$$= \left(\frac{1}{16} + \frac{1}{4} + \frac{9}{16} + 1\right) \cdot \frac{1}{4} = \frac{30}{16} \cdot \frac{1}{4} = \frac{30}{64} = \frac{15}{32} \approx 0.469.$$
Midpoint sum. Use the midpoint of each subinterval:
$$M_4 = f\left(\frac{1}{8}\right) \cdot \frac{1}{4} + f\left(\frac{3}{8}\right) \cdot \frac{1}{4} + f\left(\frac{5}{8}\right) \cdot \frac{1}{4} + f\left(\frac{7}{8}\right) \cdot \frac{1}{4}.$$
$$= \left(\frac{1}{64} + \frac{9}{64} + \frac{25}{64} + \frac{49}{64}\right) \cdot \frac{1}{4} = \frac{84}{64} \cdot \frac{1}{4} = \frac{84}{256} = \frac{21}{64} \approx 0.328.$$
The three approximations are $0.219$, $0.469$, $0.328$. They are all over the place for $n = 4$. But as $n$ increases, something happens.
For $n = 100$: $L_{100} \approx 0.328$, $R_{100} \approx 0.338$, $M_{100} \approx 0.333$. For $n = 1000$: all three give approximately $0.3333$.
They are converging. To $1/3$.
The exact area under $y = x^2$ from $0$ to $1$ is $\frac{1}{3}$.
Discomfort check. Why do the left, right, and midpoint sums all converge to the same value? For a continuous function on a closed interval, any choice of sample points (not just left, right, or midpoint - any point in each subinterval) gives the same limit. This is because continuous functions cannot behave too wildly on small intervals: the variation of $f$ over a subinterval of width $\Delta x$ shrinks to zero as $\Delta x \to 0$. So the difference between the largest and smallest rectangle heights on any subinterval goes to zero, and all the different sums get squeezed together. The formal proof uses the concept of uniform continuity: on a closed interval, $f$ can be made as close to its average value as you like on any subinterval, uniformly, by making the subintervals thin enough.
You can verify the limit $1/3$ exactly using the formula for the right-endpoint sum. With $n$ equal subintervals:
$$R_n = \sum_{k=1}^{n} f\left(\frac{k}{n}\right) \cdot \frac{1}{n} = \sum_{k=1}^{n} \frac{k^2}{n^2} \cdot \frac{1}{n} = \frac{1}{n^3} \sum_{k=1}^{n} k^2 = \frac{1}{n^3} \cdot \frac{n(n+1)(2n+1)}{6}.$$
As $n \to \infty$:
$$R_n = \frac{n(n+1)(2n+1)}{6n^3} = \frac{2n^3 + 3n^2 + n}{6n^3} \to \frac{2}{6} = \frac{1}{3}.$$
The area is exactly $\frac{1}{3}$.
Section 2: The Definition of the Integral
What we just did informally is the construction of the Riemann integral. Let us now state it carefully.
A partition $P$ of $[a, b]$ is a finite sequence $a = x_0 < x_1 < x_2 < \cdots < x_n = b$. The subintervals are $[x_{i-1}, x_i]$ and their widths are $\Delta x_i = x_i - x_{i-1}$.
For a bounded function $f$ on $[a, b]$, define on each subinterval: $$M_i = \sup_{x \in [x_{i-1}, x_i]} f(x), \qquad m_i = \inf_{x \in [x_{i-1}, x_i]} f(x).$$
The upper Riemann sum is $U(f, P) = \sum_{i=1}^{n} M_i \Delta x_i$ and the lower Riemann sum is $L(f, P) = \sum_{i=1}^{n} m_i \Delta x_i$.
The upper sum overestimates; the lower sum underestimates. For any partition: $L(f, P) \leq \text{true area} \leq U(f, P)$.
Definition (Riemann Integral). The function $f$ is Riemann integrable on $[a, b]$ if:
$$\sup_P L(f, P) = \inf_P U(f, P).$$
When this holds, the common value is the definite integral:
$$\int_a^b f(x) dx = \sup_P L(f, P) = \inf_P U(f, P).$$
Which functions are integrable? Every continuous function on $[a, b]$ is integrable. More broadly, every bounded function with only finitely many discontinuities is integrable. The key condition is that the upper and lower sums can be squeezed together.
But not every bounded function is integrable. Consider the function:
$$f(x) = \begin{cases} 1 & x \text{ rational} \\ 0 & x \text{ irrational} \end{cases}$$
On any subinterval, both rationals and irrationals are dense: $M_i = 1$ and $m_i = 0$ for every subinterval, regardless of how fine the partition is. So $U(f, P) = 1$ and $L(f, P) = 0$ for every partition. The upper and lower sums never meet. This function is not Riemann integrable.
(It is Lebesgue integrable - the Lebesgue integral handles such functions by a different, more powerful construction. But that requires measure theory, which is a separate post.)
Section 3: The Shocking Connection
Here is a fact that should surprise you.
The area under a curve and the slope of a curve seem to be completely different things. One is a geometric quantity: area, accumulated by summing rectangles. The other is an instantaneous rate: slope, measured by a limiting ratio. They were invented to solve different problems. Archimedes worked on areas around 250 BCE. Fermat worked on tangents around 1630. They had nothing to do with each other.
And yet they are inverse operations.
This is the Fundamental Theorem of Calculus. It was discovered independently by Newton and Leibniz in the 1660s-1680s, and it unified two thousand years of separate mathematical struggles in a single relationship.
It comes in two parts.
Fundamental Theorem, Part 1. Let $f$ be continuous on $[a, b]$. Define the function:
$$F(x) = \int_a^x f(t) dt.$$
Then $F$ is differentiable on $(a, b)$, and:
$$F'(x) = f(x).$$
This says: the function that measures the accumulated area from $a$ to $x$ has derivative exactly $f(x)$ - the height of the curve at $x$.
Why is this true? Think about what happens to $F(x)$ when $x$ increases by a small amount $h$:
$$F(x + h) - F(x) = \int_a^{x+h} f(t) dt - \int_a^x f(t) dt = \int_x^{x+h} f(t) dt.$$
The new strip has width $h$ and height approximately $f(x)$ (since $f$ is continuous, it barely changes over the thin interval $[x, x+h]$). So:
$$F(x + h) - F(x) \approx f(x) \cdot h.$$
Dividing by $h$:
$$\frac{F(x+h) - F(x)}{h} \approx f(x).$$
As $h \to 0$, the approximation becomes exact. So $F'(x) = f(x)$.
The argument is not a trick. It is a direct geometric statement: the rate at which area accumulates at position $x$ is exactly the height of the curve at $x$. A thin sliver of width $h$ contributes approximately $f(x) \cdot h$ of area, and dividing by $h$ gives $f(x)$.
Discomfort check. Why is the Fundamental Theorem surprising? After all, once you see the sliver argument, it seems obvious. The surprise is historical and conceptual. Integration was developed to compute areas - an accumulation problem. Differentiation was developed to compute slopes - a rate-of-change problem. Accumulation and rate of change feel like completely different things. One is global (you sum over an interval). The other is local (you look at behavior at a single point). Finding that they are inverses is like finding that addition and subtraction are inverses - it seems like it should have been obvious in retrospect, but it was genuinely hard to see before someone pointed it out. The theorem means that every fact about integration has a corresponding fact about differentiation, and vice versa. It doubled the power of both subjects.
Fundamental Theorem, Part 2. Let $f$ be continuous on $[a, b]$ and let $F$ be any function with $F' = f$. Then:
$$\int_a^b f(x) dx = F(b) - F(a).$$
This is the part you will use constantly. To evaluate a definite integral, find a function whose derivative is $f$ (an antiderivative), evaluate it at the endpoints, and subtract.
Example. $\int_0^1 x^2 dx$. We need a function $F$ with $F'(x) = x^2$. Try $F(x) = x^3/3$: then $F'(x) = x^2$. So:
$$\int_0^1 x^2 dx = F(1) - F(0) = \frac{1}{3} - 0 = \frac{1}{3}.$$
The area is exactly $1/3$, confirming our Riemann sum calculation.
Section 4: Antiderivatives
Definition. A function $F$ is an antiderivative of $f$ on an interval if $F'(x) = f(x)$ for all $x$ in that interval.
Finding antiderivatives is the reverse of differentiation. You ask: what function, when differentiated, gives $f$?
Discomfort check. Is an antiderivative unique? No. If $F'(x) = f(x)$, then so does $F(x) + C$ for any constant $C$, since $(F + C)' = F' = f$. The constant $C$ has zero derivative and disappears. Conversely, if $F$ and $G$ are both antiderivatives of $f$ on an interval, then $(F - G)' = f - f = 0$, so $F - G$ is constant on that interval (by the Mean Value Theorem). So any two antiderivatives differ by a constant. The indefinite integral $\int f(x) dx = F(x) + C$ captures this entire family. The constant $C$ is not sloppiness - it is the correct statement that antiderivatives are only unique up to a constant. When you compute a definite integral $F(b) - F(a)$, the constant cancels: $(F(b) + C) - (F(a) + C) = F(b) - F(a)$.
What the indefinite integral means, and why it exists. Integration was defined as accumulation on an interval, producing a number. So what is $\int f(x) dx$ with no bounds? It is not a different kind of integral. It does not compute an area. It is a name for the family of antiderivatives of $f$ - the set of all functions whose derivative is $f$.
The reason it exists is the Fundamental Theorem, Part 2: to evaluate any definite integral $\int_a^b f(x) dx$, you need to find an antiderivative first, then subtract its values at the endpoints. The indefinite integral packages that pre-computation step. Rather than hunting for an antiderivative every time you encounter a new pair of bounds, you find $\int f(x) dx = F(x) + C$ once and keep it. Then $\int_a^b f(x) dx = F(b) - F(a)$ for any $a$ and $b$ you like.
There is a deeper way to see the connection. By FTC Part 1, the function $A(x) = \int_a^x f(t) dt$ is itself an antiderivative of $f$ - the one that starts accumulating from $a$, so $A(a) = 0$. Any other antiderivative differs from $A(x)$ by a constant. Changing the starting point from $a$ to some other base point $a'$ gives $A'(x) = \int_{a'}^x f(t) dt = A(x) + C$ for some constant $C$ (since both have the same derivative $f$). So the $+C$ in $\int f(x) dx = F(x) + C$ is not an afterthought. It is accounting for all possible choices of where to start accumulating. Every specific value of $C$ corresponds to a specific antiderivative, which in turn corresponds to measuring accumulated area from a specific starting point. The indefinite integral is the whole family at once.
In short: the definite integral is the primary object - it computes a number. The indefinite integral is a computational tool built on top of it - it stores the antiderivative so you can evaluate the definite integral for any bounds without starting from scratch. The notation $\int f(x) dx$ looks like an integral without bounds, but what it means is “a function whose derivative is $f$.”
Standard antiderivatives (read these as the reverse of the derivative table):
| Function $f(x)$ | Antiderivative $F(x)$ |
|---|---|
| $x^n$ ($n \neq -1$) | $\frac{x^{n+1}}{n+1} + C$ |
| $\frac{1}{x}$ | $\ln \lvert x \rvert + C$ |
| $e^x$ | $e^x + C$ |
| $\sin x$ | $-\cos x + C$ |
| $\cos x$ | $\sin x + C$ |
| $\sec^2 x$ | $\tan x + C$ |
| $\frac{1}{\sqrt{1-x^2}}$ | $\arcsin x + C$ |
| $\frac{1}{1+x^2}$ | $\arctan x + C$ |
Example. Evaluate $\int_0^{\pi} \sin x dx$.
Antiderivative: $F(x) = -\cos x$. Then $F(\pi) - F(0) = -\cos(\pi) - (-\cos 0) = -(-1) - (-1) = 1 + 1 = 2$.
The area under one arch of the sine curve is exactly $2$.
Section 5: Technique - U-Substitution
Not every antiderivative can be read off from the table. We need techniques.
The first is u-substitution - the chain rule, run backward.
Recall: if $y = f(g(x))$, the chain rule gives $\frac{dy}{dx} = f'(g(x)) \cdot g'(x)$. In integral form:
$$\int f'(g(x)) \cdot g'(x) dx = f(g(x)) + C.$$
If you see an integral of the form $\int h(g(x)) \cdot g'(x) dx$, substitute $u = g(x)$, $du = g'(x) dx$:
$$\int h(g(x)) \cdot g'(x) dx = \int h(u) du.$$
The $g'(x) dx$ becomes $du$, and the integral simplifies to one in $u$ alone.
Example 1. $\int 2x \cos(x^2) dx$.
Let $u = x^2$, so $du = 2x dx$.
$$\int \cos(x^2) \cdot 2x dx = \int \cos(u) du = \sin(u) + C = \sin(x^2) + C.$$
Check by differentiating: $\frac{d}{dx}\sin(x^2) = \cos(x^2) \cdot 2x$. Correct.
Example 2. $\int_0^1 x e^{x^2} dx$.
Let $u = x^2$, $du = 2x dx$, so $x dx = du/2$. When $x = 0$, $u = 0$; when $x = 1$, $u = 1$.
$$\int_0^1 x e^{x^2} dx = \int_0^1 e^u \frac{du}{2} = \frac{1}{2}\left[e^u\right]_0^1 = \frac{1}{2}(e - 1).$$
Example 3. $\int \frac{1}{x \ln x} dx$.
Let $u = \ln x$, $du = \frac{1}{x} dx$.
$$\int \frac{1}{x \ln x} dx = \int \frac{1}{u} du = \ln|u| + C = \ln|\ln x| + C.$$
The key skill: recognize when $g'(x)$ is already present (or nearly present, up to a constant factor) as a factor in the integrand. That is the signal to substitute $u = g(x)$.
Section 6: Technique - Integration by Parts
The second major technique is integration by parts - the product rule, run backward.
Recall the product rule: $(uv)' = u’v + uv'$. Integrating both sides:
$$uv = \int u’v dx + \int uv' dx.$$
Rearranging:
$$\int uv' dx = uv - \int u’v dx.$$
Or, using differentials $dv = v' dx$, $du = u' dx$:
$$\int u dv = uv - \int v du.$$
This is the integration by parts formula. It transforms one integral into another. The art is choosing which factor to call $u$ and which to call $dv$ so that the new integral $\int v du$ is simpler than the original.
Heuristic. A rough guide for choosing $u$: prefer the function whose derivative simplifies things. Functions that simplify under differentiation: $\ln x$ (its derivative is $1/x$, which is simpler), polynomials (their derivatives eventually become zero), inverse trig functions. Functions that do not simplify: $e^x$ (its derivative is itself), $\sin x$, $\cos x$.
Example 1. $\int x e^x dx$.
Choose $u = x$ (simplifies under differentiation) and $dv = e^x dx$ (easy to integrate). Then $du = dx$ and $v = e^x$.
$$\int x e^x dx = x e^x - \int e^x dx = x e^x - e^x + C = (x-1)e^x + C.$$
Example 2. $\int \ln x dx$.
This seems to have only one factor, but write it as $\int \ln x \cdot 1 dx$. Choose $u = \ln x$, $dv = dx$. Then $du = \frac{1}{x} dx$ and $v = x$.
$$\int \ln x dx = x \ln x - \int x \cdot \frac{1}{x} dx = x \ln x - \int 1 dx = x \ln x - x + C.$$
Example 3. $\int x^2 \sin x dx$ (requires applying integration by parts twice).
First: $u = x^2$, $dv = \sin x dx$. Then $du = 2x dx$, $v = -\cos x$.
$$\int x^2 \sin x dx = -x^2 \cos x + \int 2x \cos x dx.$$
Now apply integration by parts to $\int 2x \cos x dx$: $u = 2x$, $dv = \cos x dx$, $du = 2 dx$, $v = \sin x$.
$$\int 2x \cos x dx = 2x \sin x - \int 2 \sin x dx = 2x \sin x + 2\cos x + C.$$
Combining: $\int x^2 \sin x dx = -x^2 \cos x + 2x \sin x + 2\cos x + C$.
A circular case. Sometimes the integral comes back, and this is useful.
$\int e^x \cos x dx$: Let $u = \cos x$, $dv = e^x dx$.
$$\int e^x \cos x dx = e^x \cos x + \int e^x \sin x dx.$$
Now apply again to $\int e^x \sin x dx$: $u = \sin x$, $dv = e^x dx$.
$$\int e^x \sin x dx = e^x \sin x - \int e^x \cos x dx.$$
Substituting back:
$$\int e^x \cos x dx = e^x \cos x + e^x \sin x - \int e^x \cos x dx.$$
Move the integral to the left side:
$$2\int e^x \cos x dx = e^x(\cos x + \sin x), \quad \text{so} \quad \int e^x \cos x dx = \frac{e^x(\cos x + \sin x)}{2} + C.$$
The integral appeared on both sides, and you solved for it algebraically.
Section 7: Properties of the Definite Integral
The integral satisfies properties that mirror those of sums - because it is the limit of sums.
Linearity: $$\int_a^b [f(x) + g(x)] dx = \int_a^b f(x) dx + \int_a^b g(x) dx,$$ $$\int_a^b c f(x) dx = c \int_a^b f(x) dx.$$
Additivity over intervals: $$\int_a^c f(x) dx = \int_a^b f(x) dx + \int_b^c f(x) dx \quad \text{for any } b \in [a, c].$$
Reversal of limits: $$\int_b^a f(x) dx = -\int_a^b f(x) dx.$$
Comparison: If $f(x) \leq g(x)$ on $[a, b]$, then $\int_a^b f \leq \int_a^b g$.
Bound: $\left|\int_a^b f(x) dx\right| \leq \int_a^b |f(x)| dx \leq M(b-a)$ where $M = \max_{[a,b]} |f|$.
These properties are not technicalities. They are what makes integration linear and therefore tractable. The linearity of the integral is why Fourier analysis works: if $f = \sum c_n \sin(nx)$, then $\int f = \sum c_n \int \sin(nx)$ - you can integrate term by term.
Section 8: The Geometry of the Integral
The definite integral $\int_a^b f(x) dx$ computes signed area: positive where $f > 0$, negative where $f < 0$.
Example. $\int_0^{2\pi} \sin x dx = [-\cos x]_0^{2\pi} = -\cos(2\pi) + \cos(0) = -1 + 1 = 0$.
The positive area from $0$ to $\pi$ exactly cancels the negative area from $\pi$ to $2\pi$. If you want total area (not signed), compute $\int_0^{\pi} |\sin x| dx + \int_{\pi}^{2\pi} |\sin x| dx = 2 + 2 = 4$.
Average value. The average value of $f$ on $[a, b]$ is:
$$f_{\text{avg}} = \frac{1}{b - a}\int_a^b f(x) dx.$$
The integral is the total; divide by the length to get the average. This mirrors the discrete average: $\frac{1}{n}\sum_{i=1}^{n} f(x_i)$.
The Mean Value Theorem for Integrals says: if $f$ is continuous on $[a, b]$, there exists $c \in [a, b]$ with $f(c) = f_{\text{avg}}$. The function actually attains its average value somewhere.
Section 9: Applications
Area Between Curves
The area between $f$ and $g$ where $f(x) \geq g(x)$ on $[a, b]$:
$$A = \int_a^b [f(x) - g(x)] dx.$$
To find the region between $y = x$ and $y = x^2$: they intersect where $x = x^2$, i.e., $x = 0$ and $x = 1$. On $[0, 1]$, $x \geq x^2$.
$$A = \int_0^1 (x - x^2) dx = \left[\frac{x^2}{2} - \frac{x^3}{3}\right]_0^1 = \frac{1}{2} - \frac{1}{3} = \frac{1}{6}.$$
Work
In physics, work is force times displacement - but only when force is constant. If force varies with position, $F(x)$, then the work to move an object from $a$ to $b$ is:
$$W = \int_a^b F(x) dx.$$
This is the integral in its natural physical form: accumulate infinitesimal contributions $F(x) dx$ over the path.
Expected Values and Probability
This is where integration becomes the foundation of an entire subject.
A probability density function (pdf) is a function $f(x) \geq 0$ with $\int_{-\infty}^{\infty} f(x) dx = 1$. The probability that a random variable $X$ falls in $[a, b]$ is:
$$P(a \leq X \leq b) = \int_a^b f(x) dx.$$
The expected value (mean) of $X$ is:
$$E[X] = \int_{-\infty}^{\infty} x f(x) dx.$$
The variance is:
$$\text{Var}(X) = E[(X - E[X])^2] = \int_{-\infty}^{\infty} (x - \mu)^2 f(x) dx.$$
Every formula in probability for continuous random variables is an integral. The reason we spend time on integration is not just to compute areas. It is that probability theory - and therefore statistics, and therefore virtually all of data science and machine learning - is integration theory.
The condition $\int_{-\infty}^{\infty} f(x) dx = 1$ is the normalization condition. For the normal distribution $f(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}$, verifying that this integral equals $1$ requires evaluating the Gaussian integral $\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi}$. This is an improper integral (the limits are infinite), and its evaluation is a beautiful argument we will cover in the next post.
Entropy
The entropy of a continuous probability distribution with density $f$ is:
$$H(f) = -\int_{-\infty}^{\infty} f(x) \ln f(x) dx.$$
This is the continuous analog of the discrete entropy $-\sum p_i \ln p_i$. It measures the uncertainty or spread of the distribution. Higher entropy = more spread out. The connection $0 \ln 0 = 0$ (by convention, because $\lim_{x \to 0^+} x \ln x = 0$, which you can prove via L’Hopital) makes this well-defined even where $f = 0$.
In information theory and machine learning, entropy and its relatives (KL divergence, cross-entropy) are integrals, and understanding them requires integration.
Section 10: Numerical Integration
Most integrals that arise in practice cannot be computed analytically - there is no closed-form antiderivative. The function $e^{-x^2}$ has no elementary antiderivative (this is a theorem, not a gap in our techniques). Neither do $\sin(x^2)$, $\frac{\sin x}{x}$, or most functions that arise in statistics.
For such integrals, numerical methods approximate the value to any desired precision.
Trapezoid rule. Instead of rectangles, approximate $f$ on each subinterval by a trapezoid (a straight line between the endpoints):
$$\int_a^b f(x) dx \approx \frac{\Delta x}{2} [f(x_0) + 2f(x_1) + 2f(x_2) + \cdots + 2f(x_{n-1}) + f(x_n)].$$
The error is $O((\Delta x)^2)$ per subinterval, much better than rectangles.
Simpson’s rule. Approximate $f$ on each pair of subintervals by a quadratic (a parabola through three points):
$$\int_a^b f(x) dx \approx \frac{\Delta x}{3} [f(x_0) + 4f(x_1) + 2f(x_2) + 4f(x_3) + \cdots + 4f(x_{n-1}) + f(x_n)].$$
The error is $O((\Delta x)^4)$, even better. For smooth functions, Simpson’s rule converges rapidly.
Monte Carlo integration. For high-dimensional integrals (which arise constantly in Bayesian inference and statistical mechanics), random sampling is often more efficient than grid-based methods. The idea: if $X$ is uniformly distributed on $[a, b]$, then:
$$\int_a^b f(x) dx = (b-a) \cdot E[f(X)] \approx (b-a) \cdot \frac{1}{N} \sum_{i=1}^{N} f(X_i),$$
where $X_1, \ldots, X_N$ are random samples. The error is $O(1/\sqrt{N})$ regardless of dimension - whereas grid methods suffer the curse of dimensionality. For a 100-dimensional integral, Monte Carlo is often the only feasible method.
Section 11: The Rigorous Underpinning
Proof of the Fundamental Theorem, Part 1
Theorem. If $f$ is continuous on $[a, b]$ and $F(x) = \int_a^x f(t) dt$, then $F'(x) = f(x)$.
Proof. We compute $F'(x)$ from the definition:
$$F'(x) = \lim_{h \to 0} \frac{F(x+h) - F(x)}{h} = \lim_{h \to 0} \frac{1}{h}\int_x^{x+h} f(t) dt.$$
Since $f$ is continuous at $x$, for any $\varepsilon > 0$ there exists $\delta > 0$ such that $|t - x| < \delta$ implies $|f(t) - f(x)| < \varepsilon$. For $|h| < \delta$, every $t$ in $[x, x+h]$ (or $[x+h, x]$) satisfies $|t - x| \leq |h| < \delta$, so:
$$\left|\frac{1}{h}\int_x^{x+h} f(t) dt - f(x)\right| = \left|\frac{1}{h}\int_x^{x+h} [f(t) - f(x)] dt\right| \leq \frac{1}{|h|} \cdot \varepsilon \cdot |h| = \varepsilon.$$
Since $\varepsilon$ was arbitrary, $\lim_{h \to 0} \frac{1}{h}\int_x^{x+h} f(t) dt = f(x)$. $\blacksquare$
Proof of the Fundamental Theorem, Part 2
Theorem. If $f$ is continuous on $[a, b]$ and $G$ is any antiderivative of $f$ (so $G' = f$), then:
$$\int_a^b f(x) dx = G(b) - G(a).$$
Proof. Let $F(x) = \int_a^x f(t) dt$. By Part 1, $F' = f$. Since $G$ is also an antiderivative of $f$, $(G - F)' = f - f = 0$ on $(a, b)$. By the Mean Value Theorem (if a function has zero derivative everywhere on an interval, it is constant), $G(x) - F(x) = C$ for some constant $C$.
Therefore: $G(b) - G(a) = [F(b) + C] - [F(a) + C] = F(b) - F(a) = \int_a^b f(t) dt - 0 = \int_a^b f(t) dt$. $\blacksquare$
Summary
| Concept | What it is |
|---|---|
| Riemann sum | Approximation by rectangles; $\sum f(x_i^*) \Delta x_i$ |
| Definite integral | Limit of Riemann sums; $\int_a^b f(x) dx$ |
| Integrability | Upper and lower sums converge to the same value |
| FTC Part 1 | $\frac{d}{dx}\int_a^x f(t) dt = f(x)$ |
| FTC Part 2 | $\int_a^b f(x) dx = F(b) - F(a)$ where $F' = f$ |
| Antiderivative | $F$ with $F' = f$; unique up to a constant |
| U-substitution | Chain rule reversed; $u = g(x)$, $du = g'(x) dx$ |
| Integration by parts | Product rule reversed; $\int u dv = uv - \int v du$ |
| Probability | $P(a \leq X \leq b) = \int_a^b f(x) dx$; expected value and entropy are integrals |
Read next: