Scaling Laws // Megha Bose

Prerequisite:

What Scaling Laws Are

Language model training loss follows smooth, predictable power laws in model size, dataset size, and compute budget. This empirical regularity - discovered independently by Kaplan et al. (2020) at OpenAI and refined by Hoffmann et al. (2022) at DeepMind - has become the primary framework for deciding how to allocate resources when training large language models. The central question scaling laws answer is: given a fixed compute budget $C$, how should you distribute it between training a larger model on fewer tokens versus a smaller model on more tokens?

The Kaplan et al. Power Laws

Kaplan et al. fit test loss as a power law in each axis independently, holding the others large:

$$L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}$$

with empirically estimated exponents $\alpha_N \approx 0.076$ and $\alpha_D \approx 0.095$. The key observation is that both curves are smooth power laws over many orders of magnitude - there are no cliffs or phase transitions, just diminishing returns that are entirely predictable. This predictability is practically valuable: you can extrapolate loss at $10\times$ the parameters from runs at $1\times$ and $3\times$.

For the compute budget, using the approximation $C \approx 6ND$ (the 6 accounts for the forward pass through two matrix multiplies per layer and the backward pass being twice the forward), Kaplan et al. found that compute-optimal training satisfies approximately $N \propto C^{0.73}$ and $D \propto C^{0.27}$. This recommends scaling model size faster than training tokens - an asymmetry that turned out to be partially wrong.

The Chinchilla Scaling Laws

Hoffmann et al. (2022) ran a more systematic sweep over $(N, D)$ combinations for fixed compute, rather than varying $N$ and $D$ independently. They fit a combined loss function:

$$L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}$$

where $E$ is the irreducible entropy of the data (the best possible loss on the training distribution), $A/N^\alpha$ captures the contribution from model size, and $B/D^\beta$ captures the contribution from data. Their estimates: $\alpha \approx 0.34$, $\beta \approx 0.28$, with $E$, $A$, $B$ fit from data.

Minimising $L(N, D)$ subject to the compute constraint $C = 6ND$ (equivalently, $D = C / 6N$) and differentiating with respect to $N$ yields the Chinchilla-optimal allocation:

$$N_{\text{opt}} \propto C^{0.5}, \quad D_{\text{opt}} \propto C^{0.5}$$

Both model size and training tokens should scale equally with compute. The practical consequence is the widely cited rule of thumb: train on approximately 20 tokens per parameter. GPT-3 (175B parameters, 300B tokens $\approx 1.7$ tokens/parameter) was dramatically undertrained by this criterion. Chinchilla-70B, trained on 1.4 trillion tokens ($\approx 20$ tokens/parameter), matched or exceeded GPT-3 on most benchmarks at roughly $2.5\times$ fewer parameters.

Kaplan vs. Hoffmann: Where They Disagreed

The disagreement stems from a methodological difference. Kaplan et al. trained each model for a fixed number of steps, not to convergence, which favoured larger models (larger models converge faster per step, so they look better at a fixed step budget). Hoffmann et al. trained models to near-convergence for each $(N, D)$ pair and swept the full frontier. When the full frontier is swept, the optimal $N/D$ ratio shifts dramatically toward more data: model size and tokens should scale together, not independently.

Compute, Model Size, and Training Tokens

The compute relationship $C \approx 6ND$ deserves a brief derivation. Each Transformer forward pass through a linear layer of size $(m, n)$ costs approximately $2mn$ FLOPs. For a Transformer with $n_{\text{layers}}$ layers, $d_{\text{model}}$ dimension, and $d_{ff} = 4d_{\text{model}}$:

Attention projections: $4 \times 2 d_{\text{model}}^2 \times n_{\text{layers}} = 8 d_{\text{model}}^2 n_{\text{layers}}$
FFN projections: $2 \times 2 \times 4 d_{\text{model}}^2 \times n_{\text{layers}} = 16 d_{\text{model}}^2 n_{\text{layers}}$

Total forward pass: approximately $24 d_{\text{model}}^2 n_{\text{layers}} \approx 2N$ FLOPs. The backward pass costs approximately $2\times$ the forward pass, giving $6N$ FLOPs per training token and $C \approx 6ND$ for a full training run of $D$ tokens.

Emergent Abilities

Several capabilities appear to be absent at smaller model scales and then present at larger scales - a phenomenon Wei et al. (2022) called emergent abilities. Examples include few-shot arithmetic, chain-of-thought reasoning, and certain instruction-following behaviours. These appear as discontinuous jumps on standard benchmarks plotted against model scale.

The interpretation is contested. Schaeffer et al. (2023) argue that emergent abilities are largely an artifact of benchmark and metric choice: on non-linear or discontinuous metrics (such as exact-match accuracy on a task that transitions from near-0% to near-100% across a narrow compute range), smooth underlying capability gains can appear as sharp transitions. When the same models are evaluated on continuous metrics (e.g., log probability of the correct answer), the discontinuity often disappears. The debate remains open, but practitioners should treat discontinuous capability claims with healthy scepticism and prefer continuous evaluation metrics where possible.

Compute-Optimal vs. Inference-Optimal Training

Chinchilla-optimal models minimise training loss for a fixed compute budget. But for deployed models, inference cost often dominates training cost - particularly when a model will serve millions of requests. A smaller model costs less per inference token. This motivates inference-optimal training: deliberately train a smaller model on many more tokens than Chinchilla-optimal, accepting higher training compute in exchange for a model that is cheaper to serve.

The Llama series (Touvron et al., 2023) explicitly adopts this strategy. Llama 1 7B is trained on 1 trillion tokens - roughly $143$ tokens/parameter, versus the Chinchilla-optimal $\approx 20$. The resulting model is not compute-optimal during training, but it achieves competitive quality to much larger models at a fraction of the inference cost. This trade-off makes sense whenever:

$$\text{(extra training compute)} < \text{(inference compute savings across all future queries)}$$

For models that will be queried billions of times, the breakeven point favours significant over-training of small models.

Examples

Chinchilla vs. GPT-3. GPT-3 has 175B parameters trained on 300B tokens ($C \approx 3.15 \times 10^{23}$ FLOPs). Chinchilla-optimal for the same compute budget is approximately 70B parameters trained on 1.4T tokens. Chinchilla-70B achieves lower perplexity on The Pile, outperforms GPT-3 on MMLU (57.0% vs. 43.9%), and runs at $2.5\times$ lower inference cost. The parameters-versus-data trade-off resolves strongly in favour of more data, validating the Hoffmann et al. correction to Kaplan et al.

Why Llama models are trained beyond Chinchilla-optimal. Llama 2 7B uses 2 trillion training tokens ($\approx 286$ tokens/parameter). At this scale, the model’s training loss is well below what it would have been at the Chinchilla-optimal 140B token budget. The additional training compute is justified not by a desire to lower training loss further, but by the economics of inference at scale: once the model is deployed, the cost per query scales with model size, not with training budget. A 7B model running at 286 tokens/parameter quality is strictly better than a 70B model running at 20 tokens/parameter quality from an inference-economics perspective, even if the larger model was trained more efficiently.

Read Next: