Helpful context:


Take a 256x256 grayscale image and feed it into a fully connected neural network. The input layer alone has 65,536 neurons, one per pixel. If the first hidden layer has 1,000 units, that layer alone has 65 million parameters. For a color image that number triples. Training such a network requires enormous amounts of data and compute, and the weight count grows quadratically with image resolution. But the deeper problem is not computational: it is that fully connected layers fundamentally cannot exploit what we know about images. A cat detector trained on images where cats appear in the center will fail when a cat appears in the corner, because different input neurons encode different spatial positions, and the weights learned for one position carry no information about another. The network must independently re-learn the concept of a cat for every pixel position in the image, which is both wasteful and brittle.

The key structural property of images that fully connected networks ignore is translation invariance: the identity of an object does not depend on where it appears in the image. A whisker is a whisker whether it is at pixel (20, 40) or pixel (180, 200). This means that a useful feature detector, something that recognizes a horizontal edge, or a circular arc, or a texture, should apply the same computation everywhere in the image. The idea of reusing the same computation across space, rather than learning a different function for each location, is the founding insight of convolutional neural networks.

The second structural property is locality: nearby pixels are far more correlated with each other than distant ones. The patch of pixels around a point tells you almost everything about that point; pixels on the opposite side of the image tell you almost nothing about it locally. This means a feature detector does not need to look at the entire image to do its job. A 3x3 filter that examines a small neighborhood and asks whether it looks like an edge is a more efficient and more generalizable computation than a global function of all 65,536 pixels. CNNs are the architecture that operationalizes both of these observations simultaneously.


Convolution as a Sliding Filter

The core operation in a CNN is the discrete convolution. Given an input feature map $f$ (e.g., a single-channel image) and a filter $g$ (e.g., a 3x3 matrix of learnable weights), the output at position $(i, j)$ is:

$$(f * g)[i, j] = \sum_{m} \sum_{n} f[i+m, j+n] \cdot g[m, n]$$

Here $m$ and $n$ range over the spatial extent of the filter. The filter $g$ is applied at every position $(i, j)$ in the input, and the same weights are used at every position. This is weight sharing. A network with a 3x3 filter has 9 parameters (plus a bias) for that filter, regardless of the size of the input image. Compare this to a fully connected layer over a 256x256 image, which has 65,536 parameters per output unit.

Note: the operation above is technically cross-correlation (no flipping of the filter). In signal processing, true convolution flips the filter before sliding it. For neural networks the distinction is irrelevant because the filter weights are learned and a flipped filter is equivalent to an unflipped filter with reordered weights. Both are called convolution in the ML literature.

Valid vs. same padding. When a 3x3 filter is applied at the edges of an image, it would extend beyond the image boundary. There are two standard conventions. In “valid” padding, the filter is only applied where it fits entirely within the input. A 256x256 input convolved with a 3x3 filter in valid mode produces a 254x254 output, shrinking by one pixel on each edge. In “same” padding, the input is padded with zeros around the border so that the output has the same spatial dimensions as the input. Same padding is the default in most modern architectures because it preserves spatial dimensions across layers and avoids the progressive shrinkage that would otherwise require tracking exact output sizes.


Stride and Pooling

Two mechanisms for reducing spatial resolution are stride and pooling.

A stride of $s$ means the filter advances $s$ pixels at a time instead of one. A 3x3 filter with stride 2 on a 256x256 input produces a 128x128 output, halving the spatial dimensions. Strided convolutions are computationally cheaper and reduce the number of activations that downstream layers must process.

Max pooling is a separate non-parametric downsampling operation. A 2x2 max pool over a feature map replaces each 2x2 non-overlapping block of activations with the maximum value in that block. This halves spatial resolution while retaining the strongest activation in each region. The intuition is that exact position matters less than whether a feature is present somewhere in a neighborhood. Average pooling takes the mean instead, but max pooling is more common because it better preserves strong local detections.

Both operations reduce spatial resolution deliberately. Early CNN architectures used aggressive pooling (2x2 stride-2 after every layer); modern architectures often prefer strided convolutions, which are learnable, over fixed pooling operations.


Feature Maps and Multiple Filters

A single filter detects one specific pattern. To detect many different patterns simultaneously, each convolutional layer applies $K$ filters in parallel. Each filter produces one output channel, so a layer with $K$ filters produces an output volume of shape $H' \times W' \times K$ regardless of the number of input channels. These output channels are called feature maps.

In the first convolutional layer applied to an RGB image (3 input channels), each filter has shape $3 \times 3 \times 3$: 3x3 spatial extent and 3 input channels. The filter produces a single 2D output feature map. With $K$ such filters, the layer output is $H' \times W' \times K$.

In subsequent layers, the filters operate on all the feature maps from the previous layer simultaneously, combining spatial and channel information. A filter at layer $\ell$ of spatial size $3 \times 3$ applied to $C$ input channels has $9C$ parameters (plus bias), and the output is again a single feature map.

Different filters in the same layer learn to detect different features. In networks trained on natural images, the first-layer filters tend to learn oriented edge detectors and color blob detectors. Second-layer filters combine first-layer features into textures and junctions. Higher-layer filters detect increasingly abstract structures: shapes, object parts, and eventually semantic categories. This hierarchy of features is not designed; it emerges from training on labeled images.


The Receptive Field

The receptive field of a neuron in layer $\ell$ is the set of input pixels that can influence its activation. A neuron in the first convolutional layer using a 3x3 filter sees a 3x3 region of the input. A neuron in the second layer sees a 3x3 region of the first layer’s feature maps, but each of those first-layer neurons already sees a 3x3 region of the input. So a second-layer neuron effectively sees a 5x5 region of the input. After $k$ layers of 3x3 convolutions, the receptive field of each neuron in that layer is $(2k+1) \times (2k+1)$.

This means deep networks see large regions of the input without requiring large filters. A network with 50 layers of 3x3 convolutions has neurons with a 101x101 receptive field, enough to see most of a typical image, achieved with $50 \times 9 \times C$ parameters per layer rather than $101^2 \times C$. The combination of many small filters with deep stacking is what makes modern CNNs efficient.

Strided convolutions and pooling expand the receptive field faster than stride-1 convolutions. A 3x3 filter with stride 2 at layer $\ell$ sees a 3x3 region of the layer below, but that layer below operates at double the spatial resolution, so the effective receptive field in the original input grows faster.


The Architecture Progression: LeNet to ResNet

LeNet-5 (LeCun, 1989-1998) was the first practical CNN, applied to handwritten digit recognition (MNIST). It used two convolutional layers with average pooling and three fully connected layers, with tanh activations throughout. The entire network had roughly 60,000 parameters. The convolutional-pooling-convolutional-pooling-FC structure it introduced is still the template for basic CNNs.

AlexNet (Krizhevsky, Sutskever, Hinton, 2012) won ImageNet by a large margin and launched the modern era of deep learning. It had 60 million parameters across 5 convolutional layers and 3 fully connected layers. The key innovations were: ReLU activations (training 6x faster than tanh), dropout for regularization (applied to the FC layers), data augmentation, and GPU training. AlexNet showed that depth and scale, combined with modern hardware, could produce qualitatively better representations than anything available before.

VGGNet (Simonyan, Zisserman, 2014) systematized the insight that many small filters stack better than few large ones. VGG-16 used only 3x3 filters throughout (except for a few 1x1 filters used for channel mixing) and stacked 13 convolutional layers plus 3 FC layers. Two stacked 3x3 convolutions have the same receptive field as one 5x5 convolution but use $2 \times 9 = 18$ parameters per filter versus $25$, and gain an extra non-linearity between them. VGGNet was simple and regular: uniform 3x3 convolutions, max pooling to halve resolution, doubling channels after each pooling. Its extreme simplicity made it highly influential as a baseline.

ResNet (He, Zhang, Ren, Sun, 2015) solved the problem of training very deep networks. Prior to ResNet, networks deeper than about 20 layers would see training error actually increase with depth, a phenomenon called the degradation problem. This is not overfitting (it happens on training data too); it is an optimization failure caused by vanishing gradients propagating through many layers. ResNet introduced residual connections (also called skip connections), which we examine in the next section.


Residual Connections

The degradation problem has a simple theoretical resolution: if a deep network is truly suboptimal compared to a shallower one, then the extra layers should at minimum learn the identity function and pass activations through unchanged. But learning the identity function with standard weight layers is not easy: gradient-based optimization starting from random initialization must discover that the best thing to do is output the input unchanged, which requires all the weights in the layer to conspire in a specific way.

ResNet reframes the problem. Instead of learning $H(x)$ (the desired output of a block of layers), let the layers learn the residual $F(x) = H(x) - x$, so that the output of the block is $F(x) + x$. If the identity is optimal, the network only needs to learn $F(x) = 0$, which is far easier than learning the identity mapping from scratch.

Concretely, a residual block computes:

$$\text{output} = \text{ReLU}(F(x) + x)$$

where $F(x)$ typically consists of two or three convolutional layers with batch normalization and ReLU activations between them, and the $+x$ is a direct shortcut connection that bypasses the convolutional layers entirely. If the spatial dimensions or channel counts differ between $x$ and $F(x)$, a 1x1 convolution is applied to $x$ to match dimensions.

The gradient of the loss with respect to $x$ decomposes as:

$$\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial \text{output}} \cdot \left(1 + \frac{\partial F}{\partial x}\right)$$

The $+1$ term means that gradients always have a direct path back through the shortcut, regardless of how small $\partial F / \partial x$ becomes. This prevents vanishing gradients through the residual blocks. ResNet-152 (152 layers) trained successfully where previous architectures failed at 20 layers, and the principle of residual or skip connections is now standard in virtually every modern architecture.


Modern Alternatives: Vision Transformers

The CNN architecture encodes inductive biases: the assumption of translation equivariance (same computation everywhere) and locality (small filters see small neighborhoods). These biases are very useful when data is limited because they constrain the hypothesis space in a way that matches natural images. A CNN with 5 million parameters can learn useful representations from ImageNet because it does not have to re-learn translation invariance from scratch.

Vision Transformers (ViT, Dosovitskiy et al., 2020) abandon these inductive biases. An image is split into 16x16 patches (each treated as a “token”), and a standard Transformer encoder processes the sequence of patch embeddings, using self-attention to compute dependencies between all pairs of patches. ViT makes no assumption about spatial locality: every patch attends to every other patch, and the model must learn from data that nearby patches tend to be more related.

This means ViT is less data-efficient than CNNs at small scales. Trained on ImageNet alone (1.2M images), ViT underperforms ResNet. But trained on much larger datasets (JFT-300M, 300 million images), ViT outperforms CNNs by a significant margin. The inductive biases of CNNs are useful when data is scarce; with enough data, the more flexible Transformer architecture can discover richer patterns.

Hybrid approaches like ConvNeXT and EfficientNet show that the gap is not categorical: CNNs updated with modern training techniques (longer training, better augmentation, depthwise separable convolutions) can match ViT at equivalent data and compute. The empirical consensus is that the choice depends on available data and compute: CNNs remain competitive in low-data regimes and on edge hardware; ViT-style models dominate at scale.


Summary

Architecture Key idea Parameters Limitation
Fully Connected Each unit connects to all inputs Quadratic in input size No spatial structure, scales poorly
CNN (LeNet/AlexNet) Shared filters, local connectivity Efficient, $O(\text{filter size} \times C)$ per layer Loses global context; hand-designed pooling schedule
VGGNet Uniform 3x3 filters, doubled channels after pooling Many FC parameters Very large due to FC layers
ResNet Residual connections for gradient flow Efficient at depth Still limited global receptive field per layer
Vision Transformer (ViT) Self-attention over image patches, no inductive bias Scales well Data-hungry; quadratic attention cost in sequence length

Read next: