Scaling Laws - More Compute, More Data, More Predictably Better
Helpful context:
- Transformers From First Principles - Why Attention Changed Everything
- Statistics - Turning Data Into Defensible Claims
In 2020, OpenAI published a paper showing that language model performance follows precise power laws in three variables: the number of model parameters, the number of training tokens, and the total compute budget. The curves were smooth and predictable across several orders of magnitude. This meant you could, in principle, train a tiny model and extrapolate how well a model 1,000× larger would perform - before spending the compute to build it.
Two years later, DeepMind published “Chinchilla” and delivered an uncomfortable conclusion: GPT-3, then the most capable language model in the world, was significantly undertrained. The scaling laws implied that for the same compute budget, a smaller model trained on more data would outperform it. The field had been building the wrong models.
These results have become some of the most practically consequential findings in machine learning. They don’t tell you how to build a smarter architecture. They tell you how to spend your compute budget - a question that costs billions of dollars to answer empirically.
What Is a Scaling Law?
A scaling law is an empirical relationship of the form:
$$L \propto X^{-\alpha}$$
where $L$ is the model’s loss on a held-out evaluation set, $X$ is some resource (parameters, data, compute), and $\alpha$ is a positive exponent fitted from data.
This is a power law. On a log-log plot, it’s a straight line. The remarkable thing is not that loss decreases as you add resources - of course it does - but that it does so smoothly, predictably, and over many orders of magnitude, without obvious kinks or discontinuities.
The Kaplan et al. (2020) paper measured three such relationships:
$$L(N) \propto N^{-0.076}$$
$$L(D) \propto D^{-0.095}$$
$$L(C) \propto C^{-0.048}$$
Here $N$ is the number of non-embedding parameters, $D$ is the number of training tokens, and $C$ is compute in FLOPs. These were measured on autoregressive language models (GPT-style, next-token prediction) evaluated on held-out text.
Reading the Exponents
Let’s translate these numbers into something concrete.
Parameters: $L(N) \propto N^{-0.076}$. Every doubling of parameters multiplies loss by $2^{-0.076} \approx 0.949$ - a 5.1% reduction. To halve the loss from parameters alone, you’d need $N$ to increase by a factor of $2^{1/0.076} \approx 8,000\times$.
Data: $L(D) \propto D^{-0.095}$. Every doubling of training tokens multiplies loss by $2^{-0.095} \approx 0.936$ - a 6.4% reduction. Marginally more efficient than parameters at this scale.
Compute: $L(C) \propto C^{-0.048}$. Compute is the combination of parameters and data (you need compute to process both). This is the frontier efficiency: if you’re optimizing total resource spend, a doubling of compute buys you about a 3.2% loss reduction.
Discomfort check. A 5% improvement in loss doesn’t sound impressive. But remember that loss is cross-entropy on a log scale. Improvements in perplexity - $e^L$ - compound. A loss reduction of 0.05 nats on a baseline of $\ln(100) \approx 4.6$ nats reduces perplexity from 100 to 95. More importantly, these relationships hold over 6+ orders of magnitude. Small exponents on large multipliers produce very large absolute improvements. A 10,000× increase in parameters - which is the difference between a 1M-parameter toy model and a 10B-parameter production model - corresponds to a loss reduction of roughly $10000^{0.076} \approx 1.75\times$.
These laws were measured with the implicit assumption that $N$ and $D$ are in balance - specifically, that there’s enough training data to realize the parameter count’s potential. What happens when they’re out of balance? That’s where Chinchilla comes in.
The Compute Budget Problem
Here’s the key question the scaling laws raise: given that you have a fixed compute budget $C$ (measured in FLOPs), how should you allocate it between model size $N$ and training tokens $D$?
You can’t do both independently. Compute, parameters, and tokens are connected. A rough estimate of the training compute is:
$$C \approx 6ND$$
The factor of 6 comes from the forward pass (approximately $2ND$ FLOPs), the backward pass (approximately $4ND$ FLOPs, since gradients for both activations and weights are computed). This is approximate but correct to within a factor of 2 for most architectures.
Given a budget of $C$ FLOPs, you choose $N$ and $D$ subject to $6ND = C$. What’s optimal?
Kaplan et al. (2020) argued: scale parameters much more aggressively than data. Their reasoning was that parameters show stronger returns, and you can always train with slightly less data than ideal and still get most of the benefit. This led to GPT-3: 175 billion parameters trained on 300 billion tokens.
Chinchilla: A Correction
Hoffmann et al. (2022), “Training Compute-Optimal Large Language Models,” revisited this question with a more careful experimental design. They trained hundreds of models of varying sizes and token counts, measured their losses, and fitted the optimal allocation directly.
Their finding: for compute-optimal training, model parameters and training tokens should scale in roughly equal proportion. Specifically:
$$D_{\text{optimal}} \approx 20 \cdot N$$
For 175B parameters (GPT-3 scale), the Chinchilla formula says you should use roughly $20 \times 175 \times 10^9 = 3.5$ trillion training tokens. GPT-3 used 300 billion - about 12× too few.
To illustrate: Chinchilla itself was a 70B-parameter model trained on 1.4 trillion tokens, using the same compute budget as Gopher (280B parameters, 300B tokens). Chinchilla matched or exceeded Gopher on almost every benchmark, with a model less than one quarter the size.
The practical consequence: the “bigger is better” intuition was right about the direction but wrong about the rate. You should increase data roughly as fast as you increase parameters.
A More Precise Statement
Hoffmann et al. fit the following functional form for the loss:
$$L(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + L_{\infty}$$
where $L_\infty$ is the irreducible loss (the minimum achievable, even with infinite model and data), and $A$, $B$, $\alpha$, $\beta$ are fitted constants. Their best estimates: $\alpha \approx 0.34$, $\beta \approx 0.28$.
Notice that parameters and data now have separate exponents. The cross-entropy loss decreases as a sum of two power-law terms - one in $N$, one in $D$ - plus a floor. This functional form allows you to compute the optimal allocation directly: minimize $L(N, D)$ subject to $6ND = C$.
Setting the partial derivatives equal (a calculus exercise) gives the optimal ratio:
$$\frac{D_{\text{opt}}}{N_{\text{opt}}} = \left(\frac{B\alpha}{A\beta}\right)^{1/(\alpha+\beta)}$$
Plugging in the Chinchilla numbers gives a ratio of roughly 20:1 (tokens per parameter). This is the Chinchilla constant.
What Scaling Laws Predict - and Don’t
Scaling laws predict the cross-entropy loss of language models on held-out text. This is a well-defined, continuous quantity. It is not a direct measure of capability on any specific task.
This creates a puzzle. In practice, as models scale, their capabilities on tasks like arithmetic, multi-step reasoning, and code generation don’t increase smoothly - they improve slowly for a long time, then appear to jump. These have been called emergent capabilities: abilities that are essentially absent at small scale and clearly present at large scale, with a sharp threshold in between.
Examples documented in the literature:
- Modular arithmetic: models fail entirely below roughly 50B parameters, then succeed.
- 3-digit addition: flat near zero, then sharply correct above a threshold.
- Chain-of-thought reasoning: absent in smaller models, appears in models above ~100B parameters.
Whether these are truly discontinuous phase transitions or artifacts of measurement (the metrics jump sharply even when the underlying probability shifts smoothly) is actively debated. Schaeffer et al. (2023) showed that some emergent capabilities appear continuous when measured with smoother metrics. But the practical observation stands: smooth loss improvements don’t always predict when a new capability will appear.
Discomfort check. Scaling laws are empirically fitted to a specific training paradigm: transformer architecture, next-token prediction, a particular data distribution, the Adam optimizer. They tell you how to optimally allocate resources given your current setup. They say nothing about whether a different architecture (e.g., state-space models), a different objective (e.g., RLHF), or a different data mix might change the constants or the functional form entirely. Architecture innovations like the transformer’s replacement of LSTMs weren’t predicted by any scaling law - they changed the game. Scaling laws describe the rules of the current game, not whether a better game exists.
Inference-Optimal vs. Training-Optimal Models
The Chinchilla analysis optimizes for training: given a fixed FLOPs budget for training, what model should you train?
But there’s a separate question: given a model you’ll run millions of times in production, what should it look like?
Inference cost scales with model size. A 70B-parameter model costs roughly 10× more per token to run than a 7B-parameter model. If you’re serving millions of requests per day, inference cost dominates training cost within weeks.
This leads to the “inference-optimal” training regime, sometimes called “overtraining”: take a model smaller than Chinchilla-optimal for your compute budget, and train it on more tokens than Chinchilla recommends. The model won’t reach the minimum possible loss for that compute budget, but it will be smaller and cheaper to run, and its quality per inference FLOP will be better.
LLaMA (Touvron et al., 2023) explicitly adopted this philosophy. LLaMA-7B was trained on 1 trillion tokens - far more than the ~140B tokens Chinchilla recommends for a 7B model at that compute budget. The result: a 7B-parameter model that matches or exceeds 65B-parameter models trained with fewer tokens, and which runs on a consumer GPU.
The tradeoff:
| Regime | Model size | Training tokens | Best use case |
|---|---|---|---|
| Chinchilla-optimal | Larger | Fewer | Single run, research experiments |
| Inference-optimal | Smaller | Many more | High-volume production deployment |
Both are valid - they optimize different objectives.
Test-Time Compute Scaling
Training scaling laws describe a one-time cost. But there’s a separate axis: how much compute you spend at inference time.
For a given model, you can spend more test-time compute by:
- Generating multiple candidate outputs and selecting the best (majority voting, best-of-N).
- Using chain-of-thought prompting, which forces the model to “think” before answering.
- Running a search procedure over intermediate reasoning steps.
Recent work (Snell et al., 2024 and related papers) suggests that test-time compute has its own scaling law: more inference compute leads to predictable improvement on reasoning tasks. The OpenAI o1 model and DeepSeek-R1 are built around this idea - they trade inference compute (generating long chains of thought) for quality on hard problems.
This opens a second dimension of resource allocation: not just “how big should I train?” but “how much should I think at inference time?” The two may be substitutable: a smaller model that thinks longer may match a larger model thinking quickly. Where the crossover lies is an active research question.
Practical Implications
If you’re building ML systems, the scaling laws give you three actionable guidelines:
Rule 1: Use the Chinchilla formula to size your training run. For a compute budget $C$ FLOPs, set $N \approx \sqrt{C/120}$ parameters and $D \approx 20N$ tokens. (The 120 comes from $6 \times 20$.)
Rule 2: If you care about inference cost, overtrain a smaller model. Take a model half the Chinchilla-optimal size and double the training tokens. You’ll end up with a model that’s cheaper to run and has similar quality per inference token.
Rule 3: Predict before you train. Train a series of small models at varying sizes and token counts, measure their losses, fit the scaling curve, and extrapolate to your target size. This is how every major ML lab sizes its training runs.
Summary
| Relationship | Formula | Implication |
|---|---|---|
| Loss vs. parameters | $L \propto N^{-0.076}$ | 2× params → 5% better loss |
| Loss vs. data | $L \propto D^{-0.095}$ | 2× tokens → 6.5% better loss |
| Loss vs. compute | $L \propto C^{-0.048}$ | 2× FLOPs → 3% better loss |
| Chinchilla optimal ratio | $D \approx 20N$ | Scale data and params equally |
| Training cost | $C \approx 6ND$ | Compute = 6 × params × tokens |
Scaling laws are perhaps the clearest empirical regularity in deep learning. They tell you that improvement is predictable, and they tell you how to achieve the most improvement per dollar. They don’t tell you when new capabilities will emerge, and they don’t tell you what architectural innovations might change the rules. But within the current paradigm, they are the closest thing to a physics of language models.
Read next: