Helpful context:


Here is a fact that should be deeply uncomfortable: a two-layer neural network with a million parameters trained on ten thousand data points achieves near-zero training loss and generalizes well to held-out data. Classical statistical learning theory says this should not happen. VC dimension grows with the number of parameters; a model with a million parameters fitted to ten thousand points has enormous capacity to memorize. Rademacher complexity bounds predict that the gap between training and test error should be catastrophic. And yet, in practice, it is not. The network generalizes. This is the generalization puzzle for overparameterized neural networks, and it has occupied theorists for a decade.

The standard answer invokes implicit regularization: gradient descent, by some mechanism, finds solutions that generalize despite there being infinitely many zero-training-loss solutions available. But this is an explanation by name, not by mechanism. What exactly does gradient descent find? Why should the minimum-norm solution found by gradient descent generalize at all? Neural Tangent Kernel (NTK) theory provides a rigorous answer for a specific regime - infinite width - by showing that in that regime, training a neural network is exactly equivalent to kernel regression with a specific kernel determined by the network architecture. This connection transforms a poorly-understood nonlinear optimization problem into a well-understood linear one, and reveals why overparameterized networks can generalize: they are implicitly doing kernel regression, and kernel regression has classical theoretical guarantees.

The catch is that the NTK regime - infinite width, lazy training - is not actually the regime in which modern deep learning operates. Real networks are finite-width, learn features, and do things that NTK theory cannot explain. The value of NTK is not that it describes practice, but that it provides a theoretical anchor: a solvable limit where generalization is understood, which illuminates by contrast what feature learning adds beyond that limit.


Neural Networks as Functions of Parameters

Consider a neural network $f(x; \theta)$ where $x \in \mathbb{R}^d$ is the input and $\theta \in \mathbb{R}^p$ is the parameter vector. At initialization, $\theta_0$ is drawn from some distribution (e.g., iid Gaussian with variance $1/n$ per layer, where $n$ is layer width). Under gradient descent with squared loss and a dataset $\{(x_i, y_i)\}_{i=1}^N$, the parameters evolve as:

$$\dot{\theta}(t) = -\nabla_\theta \mathcal{L}(\theta(t)) = -\sum_{i=1}^N (f(x_i; \theta) - y_i) \nabla_\theta f(x_i; \theta).$$

Now linearize $f$ around the initial parameters:

$$f(x; \theta) \approx f(x; \theta_0) + \nabla_\theta f(x; \theta_0)^\top (\theta - \theta_0).$$

This is a linear model in $(\theta - \theta_0)$, with features $\nabla_\theta f(x; \theta_0)$ - the gradient of the network output with respect to the parameters, evaluated at initialization. In this linearized regime, the function $f$ evolves linearly during training, and the dynamics are completely determined by the matrix of inner products:

$$K_{ij} = \left\langle \nabla_\theta f(x_i; \theta_0),; \nabla_\theta f(x_j; \theta_0) \right\rangle = \sum_{k=1}^p \frac{\partial f(x_i; \theta_0)}{\partial \theta_k} \frac{\partial f(x_j; \theta_0)}{\partial \theta_k}.$$

This is the Neural Tangent Kernel (NTK) matrix, evaluated at initialization. It is an $N \times N$ Gram matrix measuring how similar two inputs are in “Jacobian space” - how similarly a small parameter perturbation moves the network’s output at $x_i$ versus $x_j$.


Training Dynamics Under the NTK

In continuous-time gradient descent on the squared loss $\mathcal{L} = \frac{1}{2}|f(X;\theta) - y|^2$, the function values $u(t) = f(X;\theta(t))$ evolve as:

$$\dot{u}(t) = -K(\theta(t)) (u(t) - y)$$

where $K(\theta(t)){ij} = \langle \nabla\theta f(x_i;\theta), \nabla_\theta f(x_j;\theta) \rangle$ is the NTK matrix at time $t$. If $K$ were constant throughout training, this would be a linear ODE:

$$\dot{u}(t) = -K_0 (u(t) - y)$$

with solution $u(t) - y = e^{-K_0 t}(u(0) - y)$. The residual decays exponentially, and if $K_0$ is positive definite, $u(t) \to y$ as $t \to \infty$: training loss converges to zero. The rate of convergence for each “direction” in function space is governed by the eigenvalues of $K_0$: directions corresponding to large eigenvalues converge fast; directions corresponding to small eigenvalues converge slowly or not at all.

This is exactly kernel regression. The solution to which $u(t)$ converges is:

$$f^* = K_0 (K_0 + 0 \cdot I)^{-1} y = y$$

(perfect interpolation on training data, if $K_0$ is invertible). For a new test point $x_*$, the prediction is:

$$f(x_) = k(x_, X) K_0^{-1} y$$

where $k(x_, X) = [\langle \nabla_\theta f(x_;\theta_0), \nabla_\theta f(x_i;\theta_0)\rangle]_i$ is the vector of kernel evaluations between the test point and training points. This is exactly the kernel regression prediction with kernel $K_0$.


The Infinite-Width Result (Jacot et al., 2018)

The linearization above is an approximation for finite networks. The remarkable theorem of Jacot, Gabriel, and Hongler (2018) shows that for fully-connected networks with appropriate initialization (weights drawn iid with variance $1/n$ per layer), as the width $n \to \infty$:

  1. At initialization, the NTK $K(\theta_0)$ converges in probability to a deterministic kernel $K^\infty$ (the “infinite-width NTK”).
  2. During training, the NTK $K(\theta(t))$ remains close to $K^\infty$ for all $t$ simultaneously.

Result 2 is the key. It says that in the infinite-width limit, the NTK does not evolve during training - even as the parameters $\theta(t)$ change continuously, the Gram matrix of Jacobians remains approximately constant. Training a neural network in this regime is exactly equivalent to kernel gradient descent with kernel $K^\infty$.

Why does the NTK freeze? Intuitively: with $p \to \infty$ parameters and a fixed dataset of $N$ points, the gradient $\nabla_\theta f(x;\theta)$ is a vector in $\mathbb{R}^p$. The update $\dot{\theta} = -\nabla_\theta \mathcal{L}$ moves $\theta$ by $O(1/\sqrt{p})$ in each coordinate (the gradient is bounded and spread over $p$ directions). Each individual weight barely moves. Since the NTK is determined by the Jacobian, and the Jacobian at any point changes by $O(1/\sqrt{p})$ per step, the NTK changes by $O(1/\sqrt{p}) \to 0$ as $p \to \infty$. The function $f$ changes dramatically (it goes from random initialization to perfectly fitting the data), but individual parameters move imperceptibly.

This is why the regime is called “lazy training” or the “kernel regime” - the weights are too numerous and too small for any individual weight change to matter, but the collective effect of all their tiny changes implements exactly the kernel regression update.


The NTK for Common Architectures

The infinite-width NTK has a closed-form recursive structure. For a fully-connected network with $L$ layers, activation function $\phi$, and input normalization, the NTK is:

$$K^{(L)}(x, x') = K^{(L-1)}(x, x') + \Theta^{(L-1)}(x, x') \cdot \dot{K}^{(L)}(x, x')$$

where $\Theta^{(L-1)}$ is the (L-1)-layer NTK, $K^{(L)}$ is the corresponding NNGP (Neural Network Gaussian Process) kernel, and $\dot{K}^{(L)}(x, x') = \mathbb{E}_{(u,v) \sim \mathcal{N}(0, \Sigma)}[\phi'(u)\phi'(v)]$ involves the derivative of the activation. For ReLU activations, these expectations can be computed in closed form using the arc-cosine kernel formula.

For the NNGP kernel at layer $l$:

$$K^{(l)}(x, x') = \mathbb{E}_{w \sim \mathcal{N}(0,I)}\left[\phi\left(w^\top x\right)\phi\left(w^\top x'\right)\right]$$

which for ReLU gives $K^{(l)}(x, x') = \frac{|x||x'|}{2\pi}(\sin\theta + (\pi - \theta)\cos\theta)$ where $\theta = \arccos(x^\top x' / |x||x'|)$. The NTK for convolutional networks also has a recursive form, replacing dot products with spatial averages over patches.


Finite-Width Corrections and Feature Learning

The NTK analysis holds exactly only in the strict $n \to \infty$ limit. For finite $n$, two things change:

The NTK evolves. The NTK at training step $t$ differs from the initial NTK by $O(1/\sqrt{n})$ per step. For small networks, this deviation accumulates substantially over training. The network does not behave like fixed kernel regression; instead, the effective kernel changes as features are learned.

Feature learning. In the NTK regime, each neuron’s weights change by $O(1/n)$ over all of training - imperceptibly. The neurons are not reorganized; they are simply computing slightly different linear combinations of the same fixed random features from initialization. In finite-width networks with large learning rates or when mean-field scaling is used instead of NTK scaling, neurons can change by $O(1)$ amounts - they genuinely reorganize their representations. This is called feature learning or the “mean-field” or “muP” regime (Maximal Update Parameterization, Yang et al., 2022).

Feature learning is believed to be necessary for several practically important phenomena:

  • Transfer learning: fine-tuning a pretrained model on a new task works because the learned features from pretraining are useful for the new task. In the NTK regime, features cannot change, so fine-tuning is simply kernel regression with a fixed kernel - much less flexible.
  • Generalization on structured data: on tasks with low-dimensional structure (e.g., learning a function that depends on only a few input features), finite-width networks can discover this structure by reorganizing neurons. The NTK kernel is isotropic and cannot concentrate computation on relevant directions.
  • Depth advantage: empirically, deep networks generalize better than shallow kernel methods with the NTK kernel, suggesting that depth enables feature learning that transcends the NTK.

What NTK Explains About Generalization

Despite its limitations, NTK theory provides the first clean theoretical explanation for why overparameterized networks generalize.

In the infinite-width NTK regime, the trained network is exactly the minimum-norm interpolant of the training data in the reproducing kernel Hilbert space (RKHS) of $K^\infty$. The minimum-norm interpolant minimizes $|f|_{\mathcal{H}}$ subject to $f(x_i) = y_i$ for all training points. For kernel regression, classical theory gives bounds on the generalization error:

$$\mathbb{E}\left[(f(x) - y)^2\right] \leq \frac{|f^*|_{\mathcal{H}}^2}{n} \cdot \text{(complexity term)}$$

where $|f^*|_{\mathcal{H}}$ is the RKHS norm of the true function. If the true function has small RKHS norm under $K^\infty$ - if it is “simple” in the sense defined by the architecture’s NTK - then the minimum-norm interpolant generalizes, even with more parameters than data points.

This gives a concrete mechanism for implicit regularization: gradient descent on overparameterized networks converges to the minimum-norm solution in the NTK RKHS (or an approximation thereof for finite width). The implicit prior is that the target function should be simple relative to the network’s NTK. For natural tasks (image classification, language), this is plausibly satisfied because the architecture’s inductive biases align with the structure of the data.


Summary

Concept NTK regime (infinite width) Feature learning regime (finite width)
Weight changes during training $O(1/\sqrt{p}) \to 0$ $O(1)$
NTK evolution Frozen (constant) Evolves with training
Training is equivalent to Kernel regression with $K^\infty$ Complex nonlinear optimization
Transfer learning Not possible (fixed kernel) Works (features reused)
Depth advantage Encoded in fixed $K^\infty$ Additional via feature reorganization
Theory Fully rigorous Partially understood
Concept Definition
NTK matrix $K_{ij} = \langle \nabla_\theta f(x_i;\theta_0), \nabla_\theta f(x_j;\theta_0) \rangle$ - Gram matrix of Jacobians
Lazy training Regime where NTK is approximately constant during training
Infinite-width limit $n \to \infty$; NTK converges to deterministic $K^\infty$ and freezes
NNGP kernel Kernel induced by an infinite-width network at initialization (Gaussian process)
Kernel regression prediction $f(x_) = k(x_, X) K^{-1} y$
Minimum-norm interpolant Solution with smallest RKHS norm that fits training data exactly
Feature learning Finite-width regime where neurons reorganize; beyond NTK predictions
muP scaling Maximal Update Parameterization; rescales weights so that all layers learn features at the same rate
Generalization in NTK Bounded by $|f^*|_{\mathcal{H}}^2 / n$; holds if target has small RKHS norm

Read next: