Universal Approximation - Why Neural Networks Can Fit Anything
Helpful context:
- Neural Networks & Perceptrons - Function Approximation, Layer by Layer
- Backpropagation - The Circuit That Runs Gradients Backward
Here is a provocative question: can a neural network represent any function? At first glance this sounds trivially true. You could imagine a network that just memorizes every input-output pair in a lookup table. But the real question is subtler: can a compact neural network with a fixed, manageable number of parameters represent smooth functions well, to arbitrary precision, even on inputs it has never seen? This is the question that the universal approximation theorems actually answer, and the answer has meaningful caveats that explain a great deal about why neural networks succeed and fail in practice.
Think about what you want from a model. You have some unknown function $f: \mathbb{R}^d \to \mathbb{R}$, say the relationship between a patient’s blood markers and disease severity. You want to approximate $f$ using a neural network. The approximation question has two parts: the expressibility question (is the function class of neural networks rich enough to contain a good approximation to $f$?) and the learnability question (will gradient descent, on finite data, find that approximation?). Universal approximation theorems address expressibility only. They say nothing about learnability, and conflating the two is one of the most common misunderstandings in the field.
The theorems also reveal a deep tension between width and depth: a single wide hidden layer can approximate anything, but a deep network with the same total number of parameters can represent exponentially more functions. Understanding why gives you real intuition about why modern deep networks are designed the way they are.
The Cybenko-Hornik Theorem
The classical result is due to Cybenko (1989) for sigmoid activations and Hornik, Stinchcombe, and White (1991) for the general case.
Theorem. Let $\sigma$ be any continuous, bounded, non-constant function (e.g., sigmoid, tanh). Let $K \subset \mathbb{R}^d$ be any compact (closed and bounded) set. Then for any continuous function $f: K \to \mathbb{R}$ and any $\epsilon > 0$, there exists a two-layer neural network $\hat{f}$ with a finite number of hidden units such that:
$$\sup_{x \in K} |f(x) - \hat{f}(x)| < \epsilon$$
The network has the form $\hat{f}(x) = \sum_{k=1}^{N} v_k \sigma(w_k \cdot x + b_k)$ for some weights $v_k, w_k, b_k$.
This is a striking result. One hidden layer, with enough neurons and any non-polynomial activation, can approximate any continuous function on any compact domain to any desired precision. The function class of two-layer networks is dense in the space of continuous functions under the uniform norm.
The proof idea for sigmoid activations uses the fact that sigmoid can approximate indicator functions of intervals, and any continuous function can be approximated by a sum of weighted indicator functions (a Riemann sum). Each hidden unit is a shifted, scaled sigmoid that approximates an indicator function for a different region, and the output weights $v_k$ carry the function’s value in that region.
The non-polynomial condition. The theorem fails for polynomial activations. A single hidden layer with polynomial activation has outputs that are polynomials in $x$, and polynomials cannot uniformly approximate all continuous functions on a compact set (they can approximate, but the degree must grow without bound). The condition “non-polynomial” is equivalent to “the closure of the function class of one-hidden-layer networks is dense in $C(K)$.” ReLU is piecewise linear, not polynomial, and does satisfy this condition, though the original proof assumed bounded activations.
Why “Universal” Is Weaker Than It Sounds
The Cybenko-Hornik theorem is existential, not constructive. It says such a network with $N$ neurons exists, but provides no bound on $N$, no algorithm for finding the weights, and no guarantee that gradient descent will converge to it.
The number of neurons required can be astronomical. For general continuous functions, the required width grows at least as fast as the dimension of the function space you are approximating. On a $d$-dimensional domain, approximating a function at $\epsilon$ precision can require $\Omega(\epsilon^{-d})$ neurons, even for smooth functions. This is the curse of dimensionality: the number of neurons grows exponentially with dimension. Universal approximation theorems do not resolve this; they only say that some finite $N$ exists.
A related subtlety: the theorem proves uniform approximation on a compact set, but the weights needed to achieve it might be exponentially large. Large weights create deep-learning-specific problems: they are far from initialization, gradients may be very large or vanishing, and optimization may not find them.
The practical takeaway is: if your network cannot fit the training data at all, it is likely too small or constrained. Universal approximation assures you that in principle enough capacity exists. But fitting training data does not imply generalization, and universal approximation says nothing about generalization.
Width vs. Depth: Exponential Separation
The classical UAT concerns wide networks (one hidden layer, unlimited neurons). A separate and arguably more practically relevant question is: what can depth give you that width alone cannot?
The answer, formalized in a series of results from the 2010s, is an exponential separation. There exist function classes that can be represented exactly by a deep network of polynomial size but require exponentially many neurons in a shallow (one-hidden-layer) network.
XOR and parity. The simplest example is the parity function on $d$ bits: $f(x_1, \ldots, x_d) = x_1 \oplus x_2 \oplus \cdots \oplus x_d$. A two-layer network with sigmoid activations needs $\Omega(2^d)$ hidden neurons to represent parity exactly. A network with $O(d)$ layers and $O(d)$ neurons per layer can represent parity with $O(d^2)$ total parameters, an exponential reduction. Depth allows the network to compose simple operations hierarchically, computing XOR of pairs, then XOR of those results, and so on, rather than computing everything from scratch at one shot.
Functions with compositional structure. Many real-world functions have natural compositional structure. An image classifier might compute edge detectors, combine them into texture detectors, combine those into shape detectors, and combine those into category detectors. This is precisely what deep networks do. A shallow network representing the same function must “unroll” this hierarchy, at exponential cost.
Telgarsky (2016) proved a clean separation: there exists a class of functions that a deep ReLU network with $O(k^3)$ units can represent that any shallow ReLU network requires $\Omega(2^k)$ units to approximate to constant precision. The depth-efficient functions are those with many rapid oscillations, which a deep network achieves by composing piecewise linear functions to create exponentially many pieces.
Barron’s Theorem and the Curse of Dimensionality
The dimension dependence in naive universal approximation bounds is severe. But Barron’s theorem (1993) shows that for a specific natural class of smooth functions, one-hidden-layer networks are free of the curse.
Let $f$ be a function whose Fourier transform $\hat{f}(\omega)$ satisfies:
$$C_f = \int_{\mathbb{R}^d} |\omega| |\hat{f}(\omega)| d\omega < \infty$$
This is a condition on the Fourier spectrum: the function cannot have too much energy at high frequencies. Intuitively, $C_f < \infty$ means $f$ is smooth: it does not oscillate wildly.
Barron’s theorem states: for any such $f$ and any probability distribution $\mu$ on the domain, there exists a one-hidden-layer network $\hat{f}$ with $n$ hidden neurons such that:
$$\int (f(x) - \hat{f}(x))^2 d\mu(x) \leq \frac{2 C_f^2}{n}$$
The approximation error is $O(1/n)$, independent of the input dimension $d$. A network with 1,000 neurons achieves the same approximation rate on a 1,000-dimensional function as on a 2-dimensional function, as long as $C_f$ is controlled.
This is a remarkable dimension-freeness result. It explains why neural networks can work in high-dimensional spaces where classical approximation theory (polynomial approximation, kernel methods with bounded bandwidth) degrades exponentially. Smooth functions, in the sense of bounded Fourier spectrum, can be represented efficiently by shallow networks regardless of dimension. The catch is that $C_f$ may itself scale with dimension for some functions, restoring dimensional dependence in those cases.
ReLU Networks and Piecewise Linear Functions
The Cybenko-Hornik theorem requires bounded activations. ReLU is unbounded, and the original proofs do not directly apply. A separate line of results characterizes what ReLU networks represent.
ReLU networks are piecewise linear functions. A single ReLU unit $\max(0, w \cdot x + b)$ partitions $\mathbb{R}^d$ into two half-spaces and is linear on each. A network of ReLU units is a composition of such operations. By the closure of piecewise linear functions under composition and linear combination, the output of any ReLU network is a piecewise linear function of the input. The number of linear pieces can grow exponentially with depth and polynomially with width.
Conversely, every continuous piecewise linear function on a compact domain is exactly representable by a ReLU network. This makes ReLU networks a complete language for the class of continuous piecewise linear functions.
The UAT for ReLU follows: since any continuous function on a compact domain can be uniformly approximated by piecewise linear functions (think: triangular hat functions at increasing density), and since every piecewise linear function is a ReLU network, ReLU networks are universal approximators. The approximation is exact for piecewise linear targets (not just approximate), which makes ReLU networks particularly well-suited to representing functions that are themselves piecewise linear, such as value functions in reinforcement learning.
What UAT Does Not Tell You
The universal approximation theorems establish expressibility. They say nothing about three things that determine whether a neural network actually works in practice.
Optimization. Even if the right weights exist, will gradient descent find them? The loss landscape of a neural network is non-convex. It has saddle points, flat regions, and local minima (though recent theory shows that for sufficiently wide networks, most local minima are global). Finding the optimal weights is a separate algorithmic question that UAT entirely ignores.
Generalization. Even if the network fits training data, will it generalize to new data? A neural network can memorize $n$ data points with $O(n)$ parameters (Zhang et al., 2017 showed this empirically; the network achieves zero training error on randomly labeled data). Memorization is consistent with universal approximation but is the opposite of generalization. Learning theory asks when and why networks generalize, and the answer involves implicit regularization through gradient descent, over-parameterization theory, and function class complexity.
Sample efficiency. UAT says a network of size $N$ can approximate $f$ to precision $\epsilon$, but does not say how many training examples you need to estimate those $N$ parameters reliably. Estimation error grows with model size; there is a tradeoff between approximation error (reduced by larger networks) and estimation error (reduced by more data). The bias-variance tradeoff reappears in this guise.
The practical heuristic. Universal approximation gives you one actionable rule: if your model is unable to achieve low training error even after extensive optimization, suspect that the model class is too restricted. This means underpowered (too few parameters), too constrained (wrong inductive biases for the problem), or using an activation function that restricts the function class. UAT says that with enough capacity, training error should be achievable. If it is not, look at capacity first.
Summary
| Theorem / Concept | Statement | Key caveat |
|---|---|---|
| Cybenko-Hornik (UAT) | One hidden layer + non-polynomial activation approximates any continuous function on compact domain | Existential; $N$ can be exponentially large; no optimization guarantee |
| Depth separation | Deep networks can represent some functions exponentially more efficiently than shallow ones | Applies to specific function classes (oscillatory, compositional) |
| Barron’s theorem | For functions with bounded Fourier spectrum, $n$ neurons achieve $O(1/n)$ error independent of dimension | Requires bounded spectral norm $C_f$; $C_f$ may scale with $d$ |
| ReLU expressiveness | ReLU networks are exactly the class of continuous piecewise linear functions | Other activations represent smoother function classes |
| Memorization (Zhang et al.) | A network can memorize $n$ random labels with $O(n)$ parameters | Memorization $\ne$ generalization; UAT is about expressibility, not generalization |
Read next: