Neural Networks & Perceptrons - Function Approximation, Layer by Layer
Helpful context:
- Logistic Regression - Classification Dressed Up as Regression
- Linear Transformations - Geometry Encoded as Arithmetic
Four points. $(0,0) \to 0$. $(1,0) \to 1$. $(0,1) \to 1$. $(1,1) \to 0$. This is XOR - output 1 when exactly one input is 1, output 0 otherwise.
No straight line separates the 0-labeled points from the 1-labeled points. Draw it: the 0s sit at opposite corners of a square, the 1s at the other two corners. Any line that separates one 0 from the 1s will include the other 0 in the wrong region. A single linear model - no matter how you choose its weights - cannot solve XOR.
In 1969, Minsky and Papert proved this formally and used it to argue that the entire neural network research program was fundamentally limited. Funding dried up. The first “AI winter” began.
What they didn’t emphasize: connect two linear models in a specific way, and suddenly XOR is solvable. Chain enough of them together and you get GPT.
The Perceptron
Frank Rosenblatt invented the perceptron in 1958. It is a single artificial neuron:
$$\text{output} = \text{sign}(w^T x + b)$$
The model takes an input vector $x \in \mathbb{R}^n$, computes a weighted sum, adds a bias, and thresholds: output $+1$ if the result is positive, $-1$ if negative. It’s a linear classifier - the decision boundary is the hyperplane $\{x : w^T x + b = 0\}$.
Training uses the perceptron learning rule. For each misclassified example $(x^{(i)}, y^{(i)})$ where $y^{(i)} \in \{-1, +1\}$:
$$w \leftarrow w + y^{(i)} x^{(i)}$$
The intuition: if the perceptron predicted $-1$ but the true label is $+1$, adding $x^{(i)}$ to $w$ makes $w^T x^{(i)}$ larger, nudging the prediction toward $+1$. If it predicted $+1$ but the truth is $-1$, subtracting $x^{(i)}$ nudges it back.
Perceptron Convergence Theorem. If the training data is linearly separable - if there exists a hyperplane that perfectly separates the positive from negative examples - then the perceptron learning rule converges to a separating hyperplane in a finite number of steps. The bound depends on the margin (how wide the separation is) and the magnitude of the examples, but it is always finite.
This is a clean theoretical guarantee. The algorithm works when the problem is solvable by a linear classifier.
The problem: XOR is not linearly separable. The perceptron cannot learn it, ever, no matter how long you run it. Minsky and Papert proved this and generalized the argument to a wide class of functions. The conclusion felt devastating: if a single neuron can’t do XOR, what hope does the whole approach have?
The answer required one more idea.
Multi-Layer Perceptrons: The Core Idea
Compose multiple layers of linear-then-nonlinear operations. The first layer transforms the input into a new representation. The second layer classifies that representation. The key insight: the first layer can transform the input into a space where the problem is linearly separable - even if it isn’t in the original space.
Concretely, a two-layer network (one hidden layer, one output layer) computes:
$$h = \sigma(W_1 x + b_1), \qquad \hat{y} = W_2 h + b_2$$
where $\sigma$ is a nonlinear activation function applied elementwise, $W_1 \in \mathbb{R}^{d_1 \times n}$ and $W_2 \in \mathbb{R}^{d_2 \times d_1}$ are weight matrices, and $b_1, b_2$ are bias vectors.
The hidden layer $h$ is a new representation of the input. The output layer is a linear classifier on top of this representation. The nonlinearity $\sigma$ is what makes the first layer more than just another linear transformation.
Solving XOR. Let’s do this explicitly. Inputs $x = (x_1, x_2)$, labels $y = x_1 \oplus x_2$.
Use two hidden neurons with sigmoid activation. Set:
$$W_1 = \begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}, \quad b_1 = \begin{pmatrix} -0.5 \\ -1.5 \end{pmatrix}$$
$$W_2 = \begin{pmatrix} 1 & -2 \end{pmatrix}, \quad b_2 = 0$$
For $(x_1, x_2) = (0, 0)$: $W_1 x + b_1 = (-0.5, -1.5)$. After sigmoid: $h \approx (0.38, 0.18)$. Output: $W_2 h + b_2 \approx 0.38 - 0.36 = 0.02 \approx 0$. Correct.
For $(x_1, x_2) = (1, 1)$: $W_1 x + b_1 = (0.5 - 0.5 + 1 - 0.5, 0.5 - 1.5 + 1 - 1.5) = (1.5 - 0.5, …) $. Let’s think more carefully. $W_1 x = (x_1 + x_2, x_1 + x_2) = (2, 2)$. Add $b_1$: $(1.5, 0.5)$. After sigmoid: $h \approx (0.82, 0.62)$. Output: $0.82 - 2(0.62) = 0.82 - 1.24 = -0.42 \approx 0$. Correct.
The general principle is what matters: a two-layer network can separate XOR by first learning a new feature representation where the classes become linearly separable, then doing linear classification on that representation. Adding a hidden layer breaks the linearity constraint.
Discomfort check. This might feel like magic. If one layer can’t do it, why can two? The nonlinearity is the key - without it, composing linear transformations gives you another linear transformation. Two linear layers with no activation is just $W_2(W_1 x + b_1) + b_2 = (W_2 W_1)x + (W_2 b_1 + b_2)$, which is still linear. The activation function $\sigma$ is what allows the network to “bend” the feature space. With enough bending, any shape of decision boundary becomes possible.
Activation Functions
The nonlinearity $\sigma$ is crucial. Different choices have different properties.
Sigmoid: $\sigma(x) = \frac{1}{1+e^{-x}}$. Output in $(0, 1)$. Nice probabilistic interpretation. But saturates at large $|x|$: when $|x|$ is large, $\sigma'(x) \approx 0$. During backpropagation, the gradient gets multiplied by $\sigma'(x)$ at each layer. In a deep network with sigmoid activations, products of tiny gradients converge to zero - the signal vanishes before it reaches the early layers. This is the vanishing gradient problem.
Tanh: $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$. Output in $(-1, 1)$. Better than sigmoid because it’s zero-centered (outputs centered around 0, which helps optimization), but still saturates. Same vanishing gradient problem in deep networks.
ReLU (Rectified Linear Unit): $\text{ReLU}(x) = \max(0, x)$.
This is the default activation in modern deep learning. Why?
First, no saturation for positive inputs: $\text{ReLU}'(x) = 1$ for $x > 0$. Gradients pass through unchanged rather than being multiplied by a small number. Deep networks can be trained.
Second, computationally trivial: just a max operation. Compare this to the exponentials required for sigmoid and tanh.
Third, empirically excellent: networks trained with ReLU typically converge faster and to better solutions than those with sigmoid or tanh.
The cost: ReLU is zero for $x < 0$, and its gradient is exactly zero there. A neuron whose pre-activation is always negative gets zero gradient and never updates - it’s “dead.” This is the dead ReLU problem.
Leaky ReLU: $\text{LeakyReLU}(x) = \max(\alpha x, x)$ for small $\alpha$ (e.g., 0.01). Allows a small gradient for negative inputs, keeping neurons alive.
GELU (Gaussian Error Linear Unit): $\text{GELU}(x) = x \cdot \Phi(x)$ where $\Phi$ is the standard normal CDF. Smooth approximation of ReLU. Used in BERT, GPT, and most modern transformers.
SiLU/Swish: $\text{SiLU}(x) = x \cdot \sigma(x)$. Similar to GELU, smooth, self-gated. Empirically strong for large models.
The choice of activation is often a hyperparameter, but ReLU is the default starting point for most architectures.
Universal Approximation
Here is one of the most striking theorems in all of machine learning.
Universal Approximation Theorem (Hornik, 1991). Let $\sigma$ be any continuous, non-constant, bounded activation function. Then for any continuous function $f: [0,1]^n \to \mathbb{R}$ and any $\varepsilon > 0$, there exists a single hidden layer network with $N$ neurons (for some finite $N$) that approximates $f$ to within $\varepsilon$ everywhere on $[0,1]^n$.
In plain terms: a one-hidden-layer network with enough neurons can approximate any continuous function on a compact set, to arbitrary precision. Neural networks are universal function approximators.
This is remarkable. It says that linear regression’s fundamental limitation - it can only represent linear functions - is completely resolved by adding a single hidden layer with enough units.
Three important caveats:
It’s an existence result. The theorem says such a network exists; it doesn’t say how to find the weights. The weights might be found by gradient descent, or they might not - the theorem is silent on this.
“Enough neurons” can be enormous. For some functions, the required $N$ is exponentially large in the input dimension. Depth (many layers) is often far more parameter-efficient than width (one enormous layer). Deep networks can represent functions that shallow networks require exponentially more neurons to represent.
It doesn’t address generalization. A universal approximator on a compact set might fit training data perfectly while failing on test data. The theorem says nothing about overfitting, sample complexity, or statistical guarantees. Those require different analysis.
The theorem is a theoretical foundation, not a practical recipe. But it tells you the fundamental question isn’t “can neural networks represent this function?” (they can) - it’s “can we find the weights efficiently?”
Architecture Vocabulary
Before going further, establish the vocabulary for describing networks.
Input layer: The raw features $x \in \mathbb{R}^n$. Not a computational layer - just the input.
Hidden layer: Any intermediate layer between input and output. A network with $L$ hidden layers has depth $L+1$ (counting the output layer). “Deep” networks have many hidden layers.
Output layer: Produces $\hat{y}$. For regression: a linear layer (no activation). For binary classification: a sigmoid. For multi-class classification: a softmax.
Width: Number of neurons in a layer. Often constant across hidden layers, or tapered.
Depth: Number of layers. A “shallow” network has one or two hidden layers. Modern transformers have dozens to hundreds.
Parameters: All trainable quantities - weights and biases. A layer that maps $d_{\text{in}}$ inputs to $d_{\text{out}}$ outputs has $d_{\text{in}} \cdot d_{\text{out}} + d_{\text{out}}$ parameters ($W$ and $b$).
Total parameters for a network with layers of width $d_0, d_1, \ldots, d_L$:
$$\text{params} = \sum_{l=1}^L (d_{l-1} + 1) \cdot d_l$$
GPT-3 has $d_0 = 12288$ (token dimension) and 175 billion total parameters across all weight matrices. This is not unusual for modern large language models.
Connection to Linear Algebra
Zooming out, a neural network is a composition of operations. What are they?
One layer computes:
$$h = \sigma(Wx + b)$$
This is a linear transformation ($Wx + b$) followed by an elementwise nonlinearity ($\sigma$). The linear transformation stretches, rotates, and reflects the input. The nonlinearity “bends” the space - it introduces curvature that linear transformations cannot.
A deep network is a composition:
$$\hat{y} = W_L \sigma(W_{L-1} \sigma(\cdots \sigma(W_1 x + b_1) \cdots) + b_{L-1}) + b_L$$
Each pair (linear + nonlinear) transforms the representation. The sequence of transformations progressively reshapes the input into something that the final linear layer can classify.
Without the nonlinearities, the entire stack collapses. Composing linear transformations is just matrix multiplication:
$$W_L(W_{L-1}(\cdots(W_1 x + b_1)\cdots)) = (W_L W_{L-1} \cdots W_1) x + \text{const}$$
This is a single linear transformation. A 100-layer network with no activations is equivalent to a single linear layer. The depth is meaningless. The nonlinearity is what makes depth useful.
Discomfort check. “Neural networks are inspired by the brain.” Barely. A biological neuron receives inputs from thousands of other neurons via dendrites, integrates them with complex nonlinear dynamics, generates action potentials with timing that encodes information, modulates its behavior based on neuromodulators, and forms synapses that change on multiple timescales. An artificial neuron computes a dot product and applies a fixed scalar function. The biological analogy is a historical artifact from when researchers were trying to sell their ideas to funding agencies. It is not a useful guide to understanding or designing modern neural networks. Think of them as layered function composition - that’s what they are.
Summary
| Concept | Details |
|---|---|
| Perceptron | $\text{sign}(w^Tx + b)$; linear classifier |
| Convergence theorem | Converges in finite steps if data is linearly separable |
| Perceptron’s limit | Cannot learn XOR (not linearly separable) |
| MLP | Hidden layers transform features; output layer classifies |
| Why nonlinearity | Without it, composition of layers = single linear layer |
| Sigmoid | Output $(0,1)$; saturates; vanishing gradient in deep nets |
| Tanh | Output $(-1,1)$; zero-centered; still saturates |
| ReLU | $\max(0,x)$; no saturation for $x>0$; default choice |
| Universal approx. | One hidden layer + enough neurons ≈ any continuous function |
| Parameters | $\sum_l (d_{l-1}+1) \cdot d_l$ across all layers |
XOR broke the first generation of neural network research. Multi-layer networks repaired that break by introducing hidden layers that transform the input into a new representation. The nonlinearity at each layer is what allows the transformation to be non-trivial. Universal approximation says this is not a limitation - one hidden layer can approximate anything. The real questions are about efficiency (how many neurons), trainability (can we find the weights), and generalization (does fitting the training set help on new data).
Those questions are answered by what comes next: gradient descent and backpropagation.
Read next: