Prerequisite:


The Perceptron

The perceptron is the simplest model of a biological neuron. It computes a weighted sum of inputs and fires if the sum exceeds a threshold:

$$\hat{y} = \text{sign}(\mathbf{w}^T \mathbf{x} + b), \quad \text{sign}(z) = \begin{cases} +1 & z \geq 0 \ -1 & z < 0 \end{cases}.$$

The Rosenblatt update rule. Given a misclassified example $(x_i, y_i)$, update weights as:

$$\mathbf{w} \leftarrow \mathbf{w} + y_i \mathbf{x}_i, \quad b \leftarrow b + y_i.$$

If $y_i = +1$ and the example was misclassified ($\mathbf{w}^T \mathbf{x}_i + b < 0$), this nudges $\mathbf{w}$ in the direction of $\mathbf{x}_i$, increasing the score. The update is symmetric for $y_i = -1$.

Theorem (Perceptron convergence). If the data are linearly separable with margin $\gamma > 0$ and all inputs satisfy $|\mathbf{x}_i| \leq R$, then the perceptron algorithm makes at most $\left(\frac{R}{\gamma}\right)^2$ mistakes before converging to a separating hyperplane.

The guarantee is tight: if the data are not linearly separable, the algorithm never converges. This limitation motivated the development of multi-layer networks.

The XOR Problem

XOR is the canonical example that a single perceptron cannot solve. The truth table is:

$x_1$ $x_2$ $x_1 \oplus x_2$
0 0 0
0 1 1
1 0 1
1 1 0

No hyperplane separates the two classes $(0,0),(1,1)$ from $(0,1),(1,0)$ - the XOR function is not linearly separable. Minsky and Papert formalized this limitation in 1969, temporarily dampening enthusiasm for perceptrons.

The fix: add one hidden layer. A two-layer network can represent XOR by computing intermediate features that are linearly separable.

Multi-Layer Perceptrons

A multi-layer perceptron (MLP) with one hidden layer of $m$ units computes:

$$\mathbf{h} = \sigma(W_1 \mathbf{x} + \mathbf{b}_1), \quad \hat{y} = W_2 \mathbf{h} + \mathbf{b}_2,$$

where $W_1 \in \mathbb{R}^{m \times d}$, $\mathbf{b}_1 \in \mathbb{R}^m$, $W_2 \in \mathbb{R}^{1 \times m}$, $\mathbf{b}_2 \in \mathbb{R}$, and $\sigma$ is a nonlinear activation applied elementwise. Without $\sigma$, the composition $W_2(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$ collapses to a single affine map - hidden layers without nonlinearities add no expressiveness.

For $L$-layer networks, the computation is

$$\mathbf{h}^{(0)} = \mathbf{x}, \quad \mathbf{h}^{(\ell)} = \sigma(W_\ell \mathbf{h}^{(\ell-1)} + \mathbf{b}_\ell) ; \text{ for } \ell = 1,\ldots,L-1, \quad \hat{y} = W_L \mathbf{h}^{(L-1)} + \mathbf{b}_L.$$

Universal Approximation Theorem

Theorem (Cybenko 1989; Hornik 1991). Let $\sigma$ be a continuous, non-constant, bounded, and monotone-increasing function (e.g., the sigmoid). Then for any continuous function $f: [0,1]^d \to \mathbb{R}$ and any $\varepsilon > 0$, there exist $m \in \mathbb{N}$, weights $W_1, W_2$, and biases $\mathbf{b}_1, \mathbf{b}_2$ such that

$$\sup_{\mathbf{x} \in [0,1]^d} \left|f(\mathbf{x}) - W_2 \sigma(W_1 \mathbf{x} + \mathbf{b}_1) - b_2\right| < \varepsilon.$$

A single hidden layer of sufficient width can approximate any continuous function on a compact domain to arbitrary precision.

Important caveats. The theorem guarantees existence but says nothing about:

  • how large $m$ needs to be (could be exponentially large in $d$),
  • how to find the weights (no constructive proof),
  • generalization from finite training data.

Universal approximation justifies neural networks as a hypothesis class, but it does not explain why deep networks work so well in practice.

Activation Functions

The choice of $\sigma$ has major practical implications.

Sigmoid: $\sigma(z) = 1/(1+e^{-z})$. Historically the default. Saturates at both extremes ($\sigma(z) \approx 0$ for $z \ll 0$, $\sigma(z) \approx 1$ for $z \gg 0$), causing vanishing gradients: $\sigma'(z) \approx 0$ in saturation zones, so gradients don’t flow through many layers.

Tanh: $\tanh(z) = (e^z - e^{-z})/(e^z + e^{-z})$. Zero-centered (unlike sigmoid), which helps optimization. Still saturates and suffers vanishing gradients.

ReLU: $\text{ReLU}(z) = \max(0, z)$. Derivative is 1 for $z > 0$ and 0 for $z < 0$ - no saturation in the positive half. Induces sparse activations: at any given input, roughly half the neurons output zero. Computationally cheap. However, neurons with $z < 0$ receive zero gradient - the “dying ReLU” problem.

Leaky ReLU: $\max(\alpha z, z)$ for small $\alpha > 0$ (e.g., $\alpha = 0.01$). Keeps a small gradient for negative inputs, preventing dead neurons.

GELU (Gaussian Error Linear Unit): $\text{GELU}(z) = z \cdot \Phi(z)$, where $\Phi$ is the standard normal CDF. Smooth approximation of ReLU; default in transformers (BERT, GPT). Has the appealing interpretation of stochastically masking each unit based on its own magnitude.

Depth Versus Width

A useful informal argument for the power of depth: a deep network with $L$ layers and width $m$ can compute functions whose circuit complexity requires an exponentially wider shallow network to match. Each layer can compute a new set of features as compositions of the previous layer’s features - depth enables hierarchical representation.

More formally, there exist functions computable by $L$-layer networks of polynomial size that require exponential size in any $(L-1)$-layer network (Telgarsky, 2016). Depth provides genuine computational advantages, not just parameter efficiency.

In practice, deep but narrow networks often outperform wide but shallow ones at the same parameter count, because depth allows hierarchical feature reuse.

Examples

XOR with one hidden layer. Set $W_1 = \begin{pmatrix} 1 & 1 \ 1 & 1 \end{pmatrix}$, $\mathbf{b}_1 = \begin{pmatrix} 0 \ -1 \end{pmatrix}$ and use ReLU. The hidden units compute $h_1 = \text{ReLU}(x_1 + x_2)$ and $h_2 = \text{ReLU}(x_1 + x_2 - 1)$. Then $W_2 = (1, -2)$, $b_2 = 0$ gives $\hat{y} = h_1 - 2h_2$, which correctly outputs 0 for $(0,0)$, 1 for $(0,1)$, 1 for $(1,0)$, and 0 for $(1,1)$.

Neural nets as composed maps. An $L$-layer MLP is a composition $f = f_L \circ f_{L-1} \circ \cdots \circ f_1$, where each $f_\ell(\mathbf{h}) = \sigma(W_\ell \mathbf{h} + \mathbf{b}_\ell)$ is an affine map followed by a pointwise nonlinearity. The power of the network comes entirely from the interleaving of these two operations. Remove the nonlinearities, and the composition collapses to a single affine map; add them back, and the network gains the ability to fold, bend, and stretch the input space in complex ways.


Read Next: