Helpful context:


You are standing at the edge of a cliff watching a rock fall. You want to know its speed at exactly the two-second mark - not the average speed from second two to second three, not the average over any interval at all. The speed at the precise instant $t = 2$.

Here is the problem. Speed is distance divided by time. But at a single instant there is no interval, no distance traveled. If you plug in the numbers, you get $\frac{0}{0}$. That expression is not just unknown - it is genuinely meaningless. Division by zero is not defined.

And yet, there is clearly something happening at $t = 2$. The rock is moving. It has a speed. We can measure it indirectly: compute the average speed over smaller and smaller intervals around $t = 2$.

If the rock’s position is $s(t) = 4.9t^2$ meters (free fall under gravity), then from $t = 2$ to $t = 2 + h$, the average speed is:

$$\frac{s(2+h) - s(2)}{h} = \frac{4.9(2+h)^2 - 4.9(4)}{h} = \frac{4.9(4 + 4h + h^2) - 19.6}{h} = \frac{19.6h + 4.9h^2}{h} = 19.6 + 4.9h.$$

Now watch what happens as we shrink $h$:

$h$ Average speed (m/s)
$1$ $24.5$
$0.1$ $20.09$
$0.01$ $19.649$
$0.001$ $19.6049$
$0.0001$ $19.60049$

These numbers are converging to $19.6$. The function $19.6 + 4.9h$ approaches $19.6$ as $h$ approaches $0$. We cannot set $h = 0$ in the original expression $\frac{s(2+h)-s(2)}{h}$ because that gives $\frac{0}{0}$. But we can ask: what value is this approaching? That is the limit. And the limit is $19.6$ m/s - the instantaneous speed of the rock at $t = 2$.

This is the entire idea of a limit in one example. We cannot always evaluate a function at a particular input. But we can ask what value the function is approaching as its input approaches that point. The rest of this post makes that idea precise.


Section 1: The Informal Idea

Write $\lim_{x \to a} f(x) = L$. Read this as: “the limit of $f(x)$ as $x$ approaches $a$ is $L$.” What it means:

As $x$ gets closer and closer to $a$ (from either side, but not equal to $a$), the values $f(x)$ get closer and closer to $L$.

Two things deserve emphasis immediately.

The function value at $a$ is irrelevant. We do not care what $f(a)$ equals, whether $f(a)$ is defined, or even whether $f(a) = L$. The limit is entirely about the tendency - where $f$ is heading - not the arrival.

Discomfort check. Why doesn’t the function value at $a$ matter? This feels wrong - surely what a function does at a point is the most important thing. The answer is that limits were invented precisely for situations where the function is not defined at the point of interest. The derivative is defined as $\lim_{h\to 0} \frac{f(x+h)-f(x)}{h}$. At $h = 0$, this expression is $\frac{0}{0}$, which is undefined. We need to know what the expression approaches as $h$ gets close to zero - not what it equals at $h = 0$, which is nothing. The entire power of limits comes from the fact that the value at the point is excluded. If it were included, limits would just be function evaluation, which is far less interesting and far less useful.

Approaching, not reaching. The condition $x \to a$ means $x$ is near $a$ but $x \neq a$. This is the essential feature, captured by the notation $0 < |x - a|$ in the formal definition (the $0 <$ part ensures $x \neq a$).

Three Examples at Increasing Difficulty

Example 1: A well-behaved polynomial. $\lim_{x\to 2} (3x + 1)$.

Substitute: $3(2) + 1 = 7$. The limit is $7$, and indeed $f(2) = 7$. For continuous functions (a concept we will define precisely in Section 6), the limit equals the function value. Polynomials are continuous everywhere, so direct substitution works.

Example 2: A removable singularity. Consider:

$$\lim_{x\to 1} \frac{x^2 - 1}{x - 1}.$$

At $x = 1$: numerator is $0$, denominator is $0$. The function is not defined at $x = 1$. But for $x \neq 1$, we can factor:

$$\frac{x^2 - 1}{x - 1} = \frac{(x-1)(x+1)}{x-1} = x + 1.$$

Since the limit only considers $x \neq 1$, this cancellation is valid. As $x \to 1$, we get $x + 1 \to 2$. So:

$$\lim_{x\to 1} \frac{x^2 - 1}{x - 1} = 2.$$

The graph of $\frac{x^2-1}{x-1}$ looks exactly like the line $y = x+1$, except with a hole (an open circle) at $(1, 2)$. The limit is $2$ even though the function has no value there.

1 2 3 1 2 3 x y hole at (1, 2) y = (x²−1)/(x−1) looks like y = x+1, with a hole at (1, 2)

Example 3: The sine limit. $\lim_{x\to 0} \frac{\sin x}{x}$.

At $x = 0$: $\frac{0}{0}$, undefined. But numerically:

$x$ $\frac{\sin x}{x}$
$1.0$ $0.84147$
$0.5$ $0.95885$
$0.1$ $0.99833$
$0.01$ $0.99998$
$0.001$ $0.9999998$

The values are converging to $1$. We will prove $\lim_{x\to 0} \frac{\sin x}{x} = 1$ rigorously using the Squeeze Theorem in Section 4. This limit is the backbone of every trigonometric derivative.


Section 2: One-Sided Limits

Sometimes a function approaches different values depending on which direction you come from. This requires separate limits from each side.

Definition. The right-hand limit $\lim_{x\to a^+} f(x) = L$ means: as $x$ approaches $a$ from values greater than $a$, $f(x)$ approaches $L$.

Definition. The left-hand limit $\lim_{x\to a^-} f(x) = L$ means: as $x$ approaches $a$ from values less than $a$, $f(x)$ approaches $L$.

Theorem. The two-sided limit $\lim_{x\to a} f(x)$ exists if and only if both one-sided limits exist and are equal:

$$\lim_{x\to a} f(x) = L \iff \lim_{x\to a^-} f(x) = L \quad \text{and} \quad \lim_{x\to a^+} f(x) = L.$$

Example: The sign function.

$$\text{sgn}(x) = \begin{cases} -1 & x < 0 \\ 0 & x = 0 \\ 1 & x > 0 \end{cases}$$

As $x \to 0^+$ (from the right), $\text{sgn}(x) = 1$ for all nearby positive $x$, so $\lim_{x\to 0^+} \text{sgn}(x) = 1$.

As $x \to 0^-$ (from the left), $\text{sgn}(x) = -1$ for all nearby negative $x$, so $\lim_{x\to 0^-} \text{sgn}(x) = -1$.

The one-sided limits exist but are different. Therefore $\lim_{x\to 0} \text{sgn}(x)$ does not exist.

−1 1 1 −1 x y sgn(0) = 0 sgn(x): left limit = −1, right limit = 1, no two-sided limit at 0

Example: The Heaviside step function.

$$H(x) = \begin{cases} 0 & x < 0 \\ 1 & x \geq 0 \end{cases}$$

Here $\lim_{x\to 0^-} H(x) = 0$ and $\lim_{x\to 0^+} H(x) = 1$. The two-sided limit does not exist. This function appears in signal processing and as an idealization of threshold activations in neural networks.

Example: The floor function. $\lfloor x \rfloor$ (greatest integer $\leq x$). At every integer $n$:

$$\lim_{x\to n^-} \lfloor x \rfloor = n - 1, \qquad \lim_{x\to n^+} \lfloor x \rfloor = n.$$

The two-sided limit does not exist at any integer. Between integers, the function is locally constant and the limit exists trivially.


Section 3: When Limits Fail to Exist

There are three fundamentally different ways a limit can fail.

Failure 1: One-Sided Limits Disagree

Covered above. The sign function, the Heaviside function, the floor function at integers. The function has a definite value it is heading toward from each side, but those values differ, so there is no single value being approached.

Failure 2: Oscillation Without Settling

Consider $\lim_{x\to 0} \sin\left(\frac{1}{x}\right)$.

As $x \to 0$, the argument $\frac{1}{x} \to \pm\infty$. The function $\sin$ oscillates between $-1$ and $1$ infinitely rapidly. For any candidate limit $L$, there are points arbitrarily close to $0$ where $\sin(1/x) = 1$ and points where $\sin(1/x) = -1$. No value is being approached.

−0.3 0.3 1 −1 x y sin(1/x): oscillations grow infinitely dense as x → 0

This is a genuinely pathological behavior. The function is not just jumping - it is thrashing between extremes, refusing to settle anywhere. There is something almost violent about it: the closer you look at the origin, the more chaotic the behavior becomes. Zoom in any amount, and you still see the full range $[-1, 1]$ packed into what remains. The function visits every value infinitely often in every neighborhood of $0$. It is not merely discontinuous - it is discontinuous in the worst possible way.

Discomfort check. What does it actually mean for a limit to “not exist”? The statement $\lim_{x\to a} f(x) = L$ makes a precise claim: for any tolerance $\varepsilon$, there is a neighborhood of $a$ where $f$ stays within $\varepsilon$ of $L$. When we say the limit does not exist, we mean: there is no value $L$ for which this claim is true. For oscillation, any candidate $L$ fails because no neighborhood of $0$ keeps $\sin(1/x)$ within $\varepsilon < 1$ of any fixed target. For jump discontinuities, any $L$ between the two one-sided limits fails for points on the wrong side. The non-existence of a limit is not a gap in our knowledge - it is a property of the function itself.

Failure 3: Unbounded Growth

Consider $\lim_{x\to 0} \frac{1}{x}$.

From the right: as $x \to 0^+$, $\frac{1}{x} \to +\infty$. We write $\lim_{x\to 0^+} \frac{1}{x} = +\infty$.

From the left: as $x \to 0^-$, $\frac{1}{x} \to -\infty$. We write $\lim_{x\to 0^-} \frac{1}{x} = -\infty$.

Since the one-sided limits are $\pm\infty$ (and are not equal, and are not finite), the two-sided limit does not exist.

A subtle point: writing $\lim_{x\to 0^+} \frac{1}{x} = +\infty$ is a convenient shorthand meaning “the function grows without bound as $x$ approaches $0$ from the right.” It is not saying the limit equals some value called infinity. Infinity is not a real number, and limits are defined to be real numbers (or to not exist). Writing $= +\infty$ is a way of describing the manner of failure.

1 −1 x y → +∞ as x → 0⁺ → −∞ as x → 0⁻ y = 1/x: vertical asymptote at x = 0

Section 4: Limit Laws

Computing limits from the definition every time would be painful. Fortunately, limits interact well with arithmetic.

Theorem (Limit Laws). Suppose $\lim_{x\to a} f(x) = L$ and $\lim_{x\to a} g(x) = M$. Then:

  1. $\lim_{x\to a} [f(x) + g(x)] = L + M$
  2. $\lim_{x\to a} [f(x) - g(x)] = L - M$
  3. $\lim_{x\to a} [c \cdot f(x)] = cL$ for any constant $c$
  4. $\lim_{x\to a} [f(x) \cdot g(x)] = LM$
  5. $\lim_{x\to a} \frac{f(x)}{g(x)} = \frac{L}{M}$, provided $M \neq 0$
  6. $\lim_{x\to a} [f(x)]^n = L^n$ for positive integers $n$

These laws reduce limit computation to algebra. To find $\lim_{x\to 2} (x^2 + 3x - 1)$, observe that the limit of a polynomial at any point equals the polynomial value: $4 + 6 - 1 = 9$.

Composition. If $\lim_{x\to a} g(x) = M$ and $f$ is continuous at $M$ (a concept we define in Section 6), then:

$$\lim_{x\to a} f(g(x)) = f\left(\lim_{x\to a} g(x)\right) = f(M).$$

This lets you pass the limit inside continuous outer functions. For example:

$$\lim_{x\to 0} \sqrt{x^2 + 4} = \sqrt{\lim_{x\to 0}(x^2 + 4)} = \sqrt{4} = 2.$$

The Squeeze Theorem

Sometimes we cannot compute a limit directly, but we can trap the function between two simpler ones that we can evaluate. If both bounds converge to the same value, the trapped function has no escape.

Theorem (Squeeze Theorem). If $g(x) \leq f(x) \leq h(x)$ for all $x$ near $a$ (but not necessarily at $a$), and if:

$$\lim_{x\to a} g(x) = L \quad \text{and} \quad \lim_{x\to a} h(x) = L,$$

then $\lim_{x\to a} f(x) = L$.

The function $f$ is squeezed between $g$ and $h$. If both bounds go to $L$, the squeezed function has no choice but to go to $L$ as well.

L a x h(x) g(x) f(x) f is squeezed between g and h; all three converge to L at a

Proof of $\lim_{x\to 0} \frac{\sin x}{x} = 1$.

For $0 < x < \frac{\pi}{2}$, comparing areas of three geometric regions in the unit circle gives the squeeze:

$$\cos x \leq \frac{\sin x}{x} \leq 1.$$

Place $O$ = origin, $A = (1, 0)$, $B = (\cos x, \sin x)$ on the unit circle, and $T = (1, \tan x)$ where the tangent to the circle at $A$ meets the ray $OB$ extended. The three regions nest inside each other: triangle $OAB \subset$ sector $OAB \subset$ triangle $OAT$. Their areas are $\frac{\sin x}{2}$, $\frac{x}{2}$, and $\frac{\tan x}{2}$ respectively, giving:

$$\frac{\sin x}{2} \leq \frac{x}{2} \leq \frac{\tan x}{2}.$$

Dividing through by $\frac{\sin x}{2} > 0$ gives $1 \leq \frac{x}{\sin x} \leq \frac{1}{\cos x}$. Inverting (and flipping direction, since all terms are positive) gives $\cos x \leq \frac{\sin x}{x} \leq 1$. As $x \to 0^+$, $\cos x \to 1$, so by squeeze, $\frac{\sin x}{x} \to 1$.

Since $\frac{\sin x}{x}$ is an even function (both $\sin x$ and $x$ change sign together, so the ratio is unchanged under $x \mapsto -x$), the left-hand limit also equals $1$. Therefore:

$$\lim_{x\to 0} \frac{\sin x}{x} = 1.$$


Section 5: The Formal $\varepsilon$-$\delta$ Definition

The informal description - “$f(x)$ gets close to $L$ as $x$ gets close to $a$” - is intuitive but imprecise. How close is “close”? What counts as “getting closer”? The $\varepsilon$-$\delta$ definition is the airtight version.

Definition ($\varepsilon$-$\delta$ limit). We say $\lim_{x\to a} f(x) = L$ if:

$$\text{for every } \varepsilon > 0, \text{ there exists } \delta > 0 \text{ such that } 0 < |x - a| < \delta \implies |f(x) - L| < \varepsilon.$$

Unpacking this piece by piece:

  • “For every $\varepsilon > 0$": think of this as a challenge. Your adversary picks an arbitrarily tight tolerance $\varepsilon$ - the maximum distance from $L$ they will accept.

  • “There exists $\delta > 0$": you respond with a neighborhood size $\delta$ around $a$.

  • "$0 < |x-a| < \delta$": if $x$ is within distance $\delta$ of $a$ (but not equal to $a$ - the $0 <$ part is essential).

  • "$|f(x) - L| < \varepsilon$": then $f(x)$ is within $\varepsilon$ of $L$.

The key is the quantifier order: $\forall \varepsilon ; \exists \delta$. Your $\delta$ can depend on $\varepsilon$ - tighter tolerances may require tighter neighborhoods. But for every tolerance your adversary names, you must have an answer. This adversarial structure is not a quirk of mathematical style. It converts the vague phrase “gets arbitrarily close” into a concrete logical challenge: if you can always win this game - producing a valid $\delta$ for any $\varepsilon$ your adversary chooses - then the limit exists. If there is any $\varepsilon$ for which you cannot win, the limit does not exist.

a L L+ε L−ε a−δ a+δ x within δ of a (blue) ⇒ f(x) within ε of L (yellow)

Discomfort check. The $\varepsilon$-$\delta$ definition looks convoluted because it is. Cauchy and Weierstrass designed it to be airtight against counterexamples, and the structure of the quantifiers ($\forall \varepsilon ; \exists \delta$, not $\exists \delta ; \forall \varepsilon$) is precisely what prevents infinitesimal hand-waving. The informal idea - $f(x)$ approaches $L$ as $x$ approaches $a$ - is clean but admits edge cases. What about a function that approaches $L$ at one particular sequence of points but misbehaves elsewhere? The $\varepsilon$-$\delta$ definition closes every loophole by requiring the guarantee to hold for all $x$ within the $\delta$-neighborhood, not just along chosen paths. Use the informal version to build intuition. Use the $\varepsilon$-$\delta$ version to write proofs and verify edge cases.

Worked $\varepsilon$-$\delta$ Proof

Claim: $\lim_{x\to 3} (2x + 1) = 7$.

Proof. Given $\varepsilon > 0$, we need to find $\delta > 0$ such that $0 < |x - 3| < \delta$ implies $|(2x+1) - 7| < \varepsilon$.

Simplify the conclusion:

$$|(2x+1) - 7| = |2x - 6| = 2|x - 3|.$$

So we need $2|x-3| < \varepsilon$, that is, $|x-3| < \varepsilon/2$.

Choose $\delta = \varepsilon/2$.

Verification: If $0 < |x-3| < \delta = \varepsilon/2$, then:

$$|(2x+1) - 7| = 2|x-3| < 2 \cdot \frac{\varepsilon}{2} = \varepsilon. \quad \blacksquare$$

The structure of every $\varepsilon$-$\delta$ proof is: scratch work to find $\delta$ in terms of $\varepsilon$, then a clean verification. The scratch work (the manipulation of $|f(x) - L|$) is how you discover $\delta$. The verification is what you write up.

A Harder $\varepsilon$-$\delta$ Proof

Claim: $\lim_{x\to 2} x^2 = 4$.

Proof. Given $\varepsilon > 0$, need $|x^2 - 4| < \varepsilon$ whenever $0 < |x-2| < \delta$.

Factor: $|x^2 - 4| = |x-2||x+2|$.

The $|x-2|$ factor we control directly. We need to bound $|x+2|$. Assume $\delta \leq 1$ (we will impose this constraint on $\delta$ in the final choice). Then $|x-2| < 1$ implies $1 < x < 3$, so $3 < x+2 < 5$, giving $|x+2| < 5$.

Therefore: $|x^2 - 4| = |x-2||x+2| < 5|x-2|$.

To make this less than $\varepsilon$, we need $|x-2| < \varepsilon/5$.

Choose $\delta = \min(1, \varepsilon/5)$. This ensures both $|x+2| < 5$ and $5|x-2| < \varepsilon$. Then:

$$|x^2 - 4| = |x-2||x+2| < \frac{\varepsilon}{5} \cdot 5 = \varepsilon. \quad \blacksquare$$

The trick of assuming $\delta \leq 1$ (or some other convenient bound) to control potentially unbounded factors is standard. You impose a preliminary constraint on $\delta$, do the algebra, then take the minimum of your constraints.


Section 6: Continuity

With limits in hand, continuity is immediate - not a new concept, but a relationship between a function’s limit and its value.

Definition. A function $f$ is continuous at $a$ if:

$$\lim_{x\to a} f(x) = f(a).$$

Three requirements are packed into this one equation:

  1. $f(a)$ is defined (the point $a$ is in the domain of $f$).
  2. $\lim_{x\to a} f(x)$ exists (the limit exists from both sides and they agree).
  3. The limit equals the function value.

Continuity says: no surprises at $a$. The value you approach equals the value you arrive at. Continuous functions are the ones where you can substitute directly - the limit is just function evaluation.

A function is continuous on an interval if it is continuous at every point of that interval. It is continuous everywhere (or simply continuous) if it is continuous at every point in its domain.

Types of Discontinuity

When continuity fails, it fails in one of four ways.

1. Removable discontinuity. The limit exists, but either $f(a)$ is not defined, or $f(a) \neq \lim_{x\to a} f(x)$. The hole in the graph of $\frac{x^2-1}{x-1}$ at $x=1$ is a removable discontinuity - the limit is $2$ but the function value is missing. “Removable” because we can repair the discontinuity by redefining $f(a) = L$. The repaired function is continuous.

1 2 2 x y ← limit = 2, f(1) undefined Removable discontinuity at x = 1

2. Jump discontinuity. Both one-sided limits exist but are unequal. The function jumps at $a$. The sign function at $0$, the floor function at integers, the Heaviside step function. No redefinition of the function value can fix this - the function genuinely approaches two different values depending on direction.

3. Infinite (essential) discontinuity. The function blows up. $\frac{1}{x}$ at $0$: one or both one-sided limits are $\pm\infty$. The discontinuity cannot be removed.

4. Oscillatory (essential) discontinuity. The function oscillates without settling. $\sin(1/x)$ at $0$: the limit does not exist, not because of a jump, but because the function visits every value in $[-1, 1]$ in any neighborhood of $0$. The most pathological type.

Removable Jump Infinite Oscillatory Four types of discontinuity at a

The Algebra of Continuous Functions

If $f$ and $g$ are continuous at $a$, then so are:

  • $f + g$ (by the sum law for limits)
  • $f - g$
  • $c \cdot f$ for any constant $c$
  • $f \cdot g$ (by the product law)
  • $f/g$, provided $g(a) \neq 0$

Composition. If $g$ is continuous at $a$ and $f$ is continuous at $g(a)$, then $f \circ g$ is continuous at $a$.

These rules give us the continuous functions for free:

  • All polynomials are continuous everywhere (by the sum and product rules, starting from the fact that $f(x) = x$ and constant functions are continuous).
  • All rational functions $p(x)/q(x)$ are continuous wherever $q(x) \neq 0$.
  • $\sin x$, $\cos x$, $e^x$, $\ln x$ (on $(0, \infty)$) are all continuous on their domains.
  • Compositions of continuous functions are continuous: $\sin(e^x)$, $e^{-x^2}$, $\ln(\cos x)$ (where defined) are all continuous.

Section 7: Continuity on Intervals - The Big Theorems

Two major theorems about continuous functions on closed bounded intervals. These theorems underpin much of calculus and analysis.

Supremum and infimum. The supremum of a set $S \subseteq \mathbb{R}$, written $\sup S$, is the smallest real number that is $\geq$ every element of $S$ - the least upper bound. If $S = (0, 1)$, then $\sup S = 1$ even though $1 \notin S$. The infimum, written $\inf S$, is the largest real number that is $\leq$ every element of $S$ - the greatest lower bound. A fundamental property of $\mathbb{R}$ (the completeness axiom): every nonempty set that is bounded above has a supremum in $\mathbb{R}$. This is what the proofs below rely on.

Intermediate Value Theorem

Theorem (IVT). If $f$ is continuous on the closed interval $[a, b]$ and $N$ is any value strictly between $f(a)$ and $f(b)$, then there exists some $c \in (a, b)$ with $f(c) = N$.

Informally: a continuous function cannot jump from one value to another without passing through everything in between. You cannot get from below zero to above zero without crossing zero.

a b c f(a) f(b) N x y IVT: some c in (a, b) satisfies f(c) = N

Proof sketch. Let $S = {x \in [a,b] : f(x) \leq N}$. This set is nonempty (it contains $a$) and bounded above (by $b$). By completeness of $\mathbb{R}$, it has a supremum $c = \sup S$. By continuity of $f$ and the properties of the supremum, one can show $f(c) = N$. The argument uses completeness essentially - it fails on $\mathbb{Q}$.

Applications:

Root-finding. To show $x^3 - x = 0.5$ has a solution: $f(0) = 0 < 0.5$ and $f(2) = 6 > 0.5$. Since $f$ is continuous (polynomial), IVT guarantees a root in $(0, 2)$.

Fixed points. If $f: [0,1] \to [0,1]$ is continuous, then $f$ has a fixed point: some $c$ with $f(c) = c$. Apply IVT to $g(x) = f(x) - x$, noting $g(0) \geq 0$ and $g(1) \leq 0$.

An everyday surprise. At any moment, there exist two antipodal points on Earth’s equator with exactly the same temperature. Proof: define $g(\theta) = T(\theta + 180°) - T(\theta)$ where $T$ is temperature at angle $\theta$. Then $g(0°) = T(180°) - T(0°)$ and $g(180°) = T(360°) - T(180°) = -g(0°)$. If $g(0°) \neq 0$, then $g$ changes sign over the interval $[0°, 180°]$. By IVT, some $c$ satisfies $g(c) = 0$, meaning $T(c) = T(c + 180°)$.

Bisection search. IVT is the engine behind binary root-finding: if $f(a) < 0 < f(b)$, split the interval in half, check the sign at the midpoint, and repeat. Each step halves the error. After $n$ steps you have located a root to within $(b-a)/2^n$ - exponential convergence from a crude initial bound, guaranteed by IVT at every step.

In ML. Neural networks with continuous activations can approximate certain target functions - the IVT is implicit in the ability of the network to hit every value in its output range.

Extreme Value Theorem

Theorem (EVT). If $f$ is continuous on the closed bounded interval $[a, b]$, then $f$ attains its maximum and minimum values. That is, there exist points $c, d \in [a, b]$ such that $f(c) \leq f(x) \leq f(d)$ for all $x \in [a,b]$.

The theorem has two requirements: the function must be continuous, and the domain must be closed and bounded (compact). Drop either requirement and the conclusion can fail.

  • Without continuity: $f(x) = 1/x$ on $(0,1]$ has no maximum - the function escapes to $+\infty$ near the endpoint.
  • Without boundedness: $f(x) = x$ on $[0, \infty)$ has no maximum - there is always a larger value further out.
  • Without closedness: $f(x) = x$ on $(0, 1)$ attains neither its supremum ($1$) nor infimum ($0$) - both values are approached but the endpoints are missing from the domain.

Each condition is doing real work. Remove any one and the function finds a way to escape.

Closed interval [a, b] ✓ a b max attained Open interval (a, b) ✗ a b supremum never reached EVT requires a closed, bounded interval

Proof sketch. The image $f([a,b])$ is bounded (continuous functions send bounded sets to bounded sets - uses sequential compactness) and closed (continuous images of compact sets are compact). A closed bounded subset of $\mathbb{R}$ contains its supremum and infimum.

Why this matters. The EVT is why we can optimize over compact domains. In machine learning, when we restrict parameters to lie in a bounded set, we are guaranteed the loss function attains its minimum somewhere in that set - assuming continuity. Without compactness, a minimum might only be approached asymptotically, never reached.


Section 8: Uniform Continuity

Think of ordinary continuity as a promise made independently at every point: “near here, I won’t change too fast.” Uniform continuity is that same promise, but made once and honored everywhere: “no matter where you are on my domain, I won’t change too fast.” The difference is whether the precision guarantee depends on location.

Standard continuity is a local concept: $f$ is continuous at $a$ if, for each $\varepsilon > 0$, there exists a $\delta > 0$ (depending on both $\varepsilon$ and $a$) such that nearby points give nearby values. Uniform continuity strengthens this by requiring the same $\delta$ to work everywhere.

Definition. $f$ is uniformly continuous on a set $D$ if for every $\varepsilon > 0$ there exists $\delta > 0$ such that for all $x, y \in D$:

$$|x - y| < \delta \implies |f(x) - f(y)| < \varepsilon.$$

The crucial difference: $\delta$ depends only on $\varepsilon$, not on the particular point $x$.

Example: uniform vs. pointwise. $f(x) = x^2$ on all of $\mathbb{R}$ is continuous but not uniformly continuous. Near large values of $x$, a small change in $x$ causes a large change in $x^2$ (the slope is $2x$, which is unbounded). To keep $|f(x) - f(y)| < \varepsilon$, you need $\delta$ to shrink as $x$ grows. No single $\delta$ works everywhere.

On the bounded interval $[-10, 10]$, however, $f(x) = x^2$ is uniformly continuous. The slope is bounded by $20$, so $\delta = \varepsilon/20$ works everywhere on the interval.

Theorem (Heine-Cantor). Every function that is continuous on a closed bounded interval $[a,b]$ is uniformly continuous on $[a,b]$.

This is a non-trivial result - it uses compactness of $[a,b]$. The proof: if $f$ were not uniformly continuous, we could find sequences $x_n, y_n$ with $|x_n - y_n| \to 0$ but $|f(x_n) - f(y_n)| \geq \varepsilon_0 > 0$. Compactness lets us extract a convergent subsequence; continuity gives the contradiction.

A stronger condition than uniform continuity is Lipschitz continuity. A function $f$ is Lipschitz continuous with constant $K$ on a set $D$ if for all $x, y \in D$:

$$|f(x) - f(y)| \leq K|x - y|.$$

This says the function cannot change faster than $K$ times the input change - the constant $K$ is a global bound on the slope. Every Lipschitz function is uniformly continuous (choose $\delta = \varepsilon/K$), but not conversely: $f(x) = \sqrt{x}$ on $[0,1]$ is uniformly continuous but not Lipschitz at $0$ since its slope grows without bound there.

Why uniform continuity matters:

  • Numerical analysis. Integration algorithms work by approximating $f$ with step functions. Uniform continuity guarantees the approximation is uniformly good, not just good at particular points.
  • Function approximation. Uniformly continuous functions can be approximated uniformly by polynomials (Weierstrass approximation theorem).
  • Differential equations. Existence theorems for ODEs often require Lipschitz continuity.

Section 9: Why This Matters in Machine Learning

The language of limits and continuity is not abstract preparation for some future application. It is the direct foundation of how learning algorithms work.

Gradient-Based Optimization Requires Continuity

The canonical training loop in supervised learning minimizes a loss function $\mathcal{L}(\theta)$ by gradient descent:

$$\theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t).$$

For the gradient $\nabla_\theta \mathcal{L}$ to exist, $\mathcal{L}$ must be differentiable. For $\mathcal{L}$ to be differentiable, it must be continuous. Differentiability requires that the limit defining the derivative exists, which requires the function to be well-behaved (continuous) in a neighborhood of each point. This is not circular - it is a chain of necessity.

A discontinuous loss function does not have a well-defined gradient at its discontinuities. If the optimizer steps into a region of discontinuity, the gradient signal becomes meaningless.

ReLU: Continuous But Not Differentiable

The Rectified Linear Unit is:

$$\text{ReLU}(x) = \max(0, x) = \begin{cases} 0 & x \leq 0 \\ x & x > 0 \end{cases}$$

Is ReLU continuous? Check the three conditions at $x = 0$:

  1. $\text{ReLU}(0) = 0$ is defined.
  2. $\lim_{x\to 0^-} \text{ReLU}(x) = 0$ and $\lim_{x\to 0^+} \text{ReLU}(x) = 0$.
  3. Both equal $\text{ReLU}(0) = 0$.

Yes - ReLU is continuous everywhere.

Is ReLU differentiable at $0$? The left-hand derivative is $\lim_{h\to 0^-} \frac{h - 0}{h} = 0$ and the right-hand derivative is $\lim_{h\to 0^+} \frac{h - 0}{h} = 1$. They disagree, so ReLU is not differentiable at $0$.

In practice, we assign the derivative at $0$ to be either $0$ or $1$ (either subgradient). This works because the set ${0}$ has measure zero - the probability that a floating-point neuron hits exactly $0$ is essentially nil, so the undefined derivative almost never affects training. ReLU’s success shows that differentiability everywhere is not actually required for gradient descent to work; continuity plus almost-everywhere differentiability is enough.

Lipschitz Continuity and Training Stability

Recall from Section 8: a function is Lipschitz continuous with constant $K$ if $|f(x) - f(y)| \leq K|x - y|$ for all $x, y$ in its domain. The constant $K$ is a global bound on the rate of change - the function cannot change faster than $K$ times the input change. In ML, this bound has three concrete consequences:

  • Gradient explosion. The Jacobian of a layer is the matrix of partial derivatives of its outputs with respect to its inputs. Its singular values measure how much the matrix stretches input vectors: if you feed every unit-length input vector through the matrix, the singular values are the lengths of the longest and shortest output vectors. If the largest singular value exceeds $1$, the layer amplifies signals, and repeated multiplication during backpropagation causes gradients to grow without bound. Spectral normalization (dividing each weight matrix by its largest singular value) caps stretching at $1$, enforcing a Lipschitz bound of $1$ per layer.

  • Generalization. A network with a small Lipschitz constant generalizes better: two inputs that are close will produce outputs that are close. This is a stability property.

  • Wasserstein GANs. The Wasserstein-1 distance (also called earth mover’s distance) between two probability distributions measures the minimum cost of transporting mass from one distribution to the other - intuitively, how much “work” it takes to reshape one distribution into the other. A key mathematical result (Kantorovich-Rubinstein duality) shows this distance can be computed by maximizing the expected difference $\mathbb{E}_p[f] - \mathbb{E}_q[f]$ over all functions $f$ that are $1$-Lipschitz. This is why the discriminator in a Wasserstein GAN must be $1$-Lipschitz: the discriminator is the function being maximized, and the Lipschitz constraint is baked into the definition of the distance itself. Gradient clipping enforces an approximate Lipschitz constraint during training.

  • Differentiability implies Lipschitz locally. If $f$ is differentiable with bounded gradient on a convex domain, then $|f(x) - f(y)| \leq \sup|\nabla f| \cdot |x - y|$.

Sigmoid Has Well-Defined Limits at Infinity

The sigmoid function:

$$\sigma(x) = \frac{1}{1 + e^{-x}}.$$

As $x \to +\infty$, $e^{-x} \to 0$, so $\sigma(x) \to \frac{1}{1+0} = 1$.

As $x \to -\infty$, $e^{-x} \to +\infty$, so $\sigma(x) \to \frac{1}{+\infty} = 0$.

These are horizontal asymptotes. Sigmoid is continuous everywhere, differentiable everywhere, and bounded between $0$ and $1$ - all desirable properties for an output activation in binary classification.

The derivative of sigmoid is $\sigma'(x) = \sigma(x)(1 - \sigma(x))$. When $x \to \pm\infty$, $\sigma(x) \to 0$ or $1$, so $\sigma'(x) \to 0$. This is the mathematical source of the vanishing gradient problem: in deep networks, gradients are multiplied across layers. If a layer’s activations are saturated (inputs at the limits), its derivative is near zero, and that near-zero factor drives the gradient of earlier layers toward zero as well. The limits at infinity tell you exactly where and why this happens.

The softmax function generalizes sigmoid to multiple classes. Its continuity everywhere is what guarantees that small changes in logits produce small changes in probabilities - a crucial stability property.

Discontinuous Activations Cause Problems

Hard threshold activations - $f(x) = \mathbf{1}[x > 0]$ - are discontinuous at $0$. The derivative is $0$ everywhere except at $0$, where it does not exist. Gradient descent provides no signal: everywhere the gradient is $0$ (or undefined), and the optimizer has no direction to move. This is why the perceptron learning rule required a separate update mechanism, and why smooth or piecewise-linear activations dominated the deep learning revolution.


Section 10: Limits at Infinity

We have discussed limits as $x \to a$ for finite $a$. Limits as $x \to \pm\infty$ describe asymptotic behavior - what a function is heading toward in the long run, once finite effects die out. This matters in ML: sigmoid and tanh saturate because they approach their limits as inputs grow large, and saturated neurons produce near-zero gradients, which is the mechanism behind the vanishing gradient problem.

Definition. $\lim_{x\to\infty} f(x) = L$ means: for every $\varepsilon > 0$, there exists $M > 0$ such that $x > M \implies |f(x) - L| < \varepsilon$.

Same game: your adversary names a tolerance $\varepsilon$, you respond with a threshold $M$ beyond which $f$ stays within $\varepsilon$ of $L$.

Horizontal asymptotes. If $\lim_{x\to\infty} f(x) = L$, then $y = L$ is a horizontal asymptote of $f$.

Example. $\lim_{x\to\infty} \frac{x^2 + 1}{x^2 + 2}$. Divide numerator and denominator by $x^2$:

$$\frac{x^2+1}{x^2+2} = \frac{1 + 1/x^2}{1 + 2/x^2} \to \frac{1 + 0}{1 + 0} = 1.$$

Example. $\lim_{x\to\infty} \frac{3x^3 - 2x}{5x^3 + 1}$. The dominant terms are the highest-degree terms:

$$\frac{3x^3 - 2x}{5x^3 + 1} = \frac{3 - 2/x^2}{5 + 1/x^3} \to \frac{3}{5}.$$

General rule: for rational functions, the limit at $\infty$ is the ratio of leading coefficients if the degrees match, $0$ if the numerator has lower degree, and $\pm\infty$ if the numerator has higher degree.

Vertical asymptotes. If $\lim_{x\to a^+} f(x) = \pm\infty$ or $\lim_{x\to a^-} f(x) = \pm\infty$, then $x = a$ is a vertical asymptote.


Section 11: The Rigorous Underpinning

This section consolidates the formal definitions and major proof sketches, connecting the intuitive arguments above to rigorous real analysis.

Formal Definition of Continuity on a Domain

$f: D \to \mathbb{R}$ is continuous on $D$ if $f$ is continuous at every point $a \in D$. In terms of sequences:

Theorem (Sequential characterization). $f$ is continuous at $a$ if and only if for every sequence $(x_n)$ in the domain with $x_n \to a$, we have $f(x_n) \to f(a)$.

This characterization is often easier to use in abstract settings. It converts limit problems into sequence problems, where the arsenal of sequence convergence theorems applies.

The Topological Characterization

There is a more elegant formulation that generalizes to arbitrary topological spaces.

Theorem. $f: \mathbb{R} \to \mathbb{R}$ is continuous if and only if for every open set $V \subseteq \mathbb{R}$, the preimage $f^{-1}(V) = {x : f(x) \in V}$ is open.

This is not just a restatement - it is a different way of seeing continuity. The $\varepsilon$-$\delta$ definition talks about distances. The topological definition talks about open sets. They are equivalent in metric spaces, but the topological version works in spaces where there is no natural notion of distance - for example, function spaces, or spaces of probability distributions.

In ML: the space of neural network functions with the topology of uniform convergence has precisely this flavor. Continuity of functionals (like the loss function viewed as a function of the network parameters) is what ensures that small perturbations to parameters produce small changes in output.

Why Completeness of $\mathbb{R}$ Is Essential

Both the IVT and EVT use completeness of the real numbers. The IVT proof takes a supremum (which exists in $\mathbb{R}$ because of completeness). The EVT proof uses sequential compactness of $[a,b]$ (which holds in $\mathbb{R}$ because bounded sequences have convergent subsequences - the Bolzano-Weierstrass theorem, which is equivalent to completeness).

On the rationals $\mathbb{Q}$, both theorems fail. Consider $f(x) = x^2 - 2$. On $\mathbb{Q}$: $f(1) = -1 < 0$ and $f(2) = 2 > 0$. By a hypothetical IVT, there would be $c \in (1,2)$ with $f(c) = 0$. But $c = \sqrt{2}$ is irrational - it does not exist in $\mathbb{Q}$. Completeness is what fills the gaps. The real numbers were essentially constructed in the 19th century to fill exactly these holes: every equation that geometry demands a solution for (the diagonal of a unit square, the circumference of a unit circle) has a real number answer, even if no fraction can express it.

Lipschitz and Differentiability

Differentiability implies local Lipschitz. If $f$ is differentiable at $a$, then $f$ is Lipschitz in a neighborhood of $a$ (with constant $|f'(a)| + \varepsilon$ for any $\varepsilon > 0$, by continuity of the derivative).

Lipschitz implies uniform continuity. If $|f(x) - f(y)| \leq K|x-y|$ for all $x, y$ in a domain, then $f$ is uniformly continuous on that domain (choose $\delta = \varepsilon/K$).

Uniform continuity implies continuity. Immediate (the $\delta$ from uniform continuity works as the $\delta$ for each individual point).

So the hierarchy is: Lipschitz $\Rightarrow$ uniform continuity $\Rightarrow$ continuity. None of these implications reverses in general.

The Cantor function (devil’s staircase) is uniformly continuous but not Lipschitz. $f(x) = \sqrt{x}$ on $[0,1]$ is uniformly continuous but not Lipschitz at $0$ (the slope is unbounded near $0$).

Proof of the Intermediate Value Theorem

Theorem. If $f:[a,b]\to\mathbb{R}$ is continuous and $f(a) < N < f(b)$, then there exists $c \in (a,b)$ with $f(c) = N$.

Proof. Let $S = {x \in [a,b] : f(x) \leq N}$. Then $S$ is nonempty ($a \in S$) and bounded above by $b$. Let $c = \sup S$.

We claim $f(c) = N$. By continuity of $f$ at $c$, given $\varepsilon > 0$, there exists $\delta > 0$ with $|x-c| < \delta \implies |f(x) - f(c)| < \varepsilon$.

Since $c = \sup S$, for any $\delta > 0$ there exists $x \in S$ with $c - \delta < x \leq c$. For such $x$: $f(x) \leq N$. Taking $x \to c$, by continuity $f(c) \leq N$.

Since $c = \sup S$, for any $\delta > 0$, the interval $(c, c+\delta)$ contains points not in $S$ (otherwise $c$ would not be the supremum). For $x \in (c, c+\delta) \setminus S$: $f(x) > N$. Taking $x \to c^+$, by continuity $f(c) \geq N$.

Therefore $f(c) = N$. $\blacksquare$


Summary

Concept Informal meaning Formal condition
$\lim_{x\to a} f(x) = L$ $f(x)$ approaches $L$ near $a$ $\forall\varepsilon>0, \exists\delta>0: 0<|x-a|<\delta \Rightarrow |f(x)-L|<\varepsilon$
Continuity at $a$ No jumps, holes, or blowups $\lim_{x\to a} f(x) = f(a)$
Removable disc. Hole in the graph Limit exists, $f(a)$ is wrong or missing
Jump disc. The function jumps One-sided limits disagree
Infinite disc. Function blows up One-sided limit is $\pm\infty$
IVT Cannot skip values Continuous on $[a,b]$ implies hits every intermediate value
EVT Maximum is attained Continuous on $[a,b]$ implies attains max and min
Uniform continuity One $\delta$ for all points $\delta$ depends only on $\varepsilon$, not on $x$
Lipschitz Bounded rate of change $|f(x)-f(y)| \leq K|x-y|$

The limit is the operative concept in all of calculus. Derivatives are limits of difference quotients. Integrals are limits of Riemann sums. Series are limits of partial sums. Continuity is a relationship between a limit and a value. Everything downstream in analysis, geometry, and applied mathematics traces back to the $\varepsilon$-$\delta$ machine built here.


Read next: