Loading lesson…
Dive into the equations that governed the last five years of AI progress, and the fresh questions they raise now that pure scaling is hitting walls.
Kaplan et al. (2020) showed that LLM loss follows smooth power laws in parameters, data, and compute. Hoffmann et al. (2022, the Chinchilla paper) showed that under a compute budget, you want to balance N and D rather than scaling parameters alone. These two papers shaped over a hundred billion dollars of capex.
Loss L as a function of compute C behaves approximately like L(C) ≈ a · C^(-α) for some constants. Separately for parameters N and data D with their own exponents. The exponent α is small (around 0.05 to 0.1), meaning you need large increases in compute to get modest loss drops.
# Rough Chinchilla-optimal scaling def compute_optimal(flops): # From Hoffmann et al. (2022): # Optimal parameters N ≈ 0.6 * sqrt(C / 6) # Optimal tokens D ≈ C / (6 * N) N = 0.6 * (flops / 6) ** 0.5 D = flops / (6 * N) return N, D N, D = compute_optimal(1e24) # 1e24 FLOPs budget print(f"Params: {N:.2e}, Tokens: {D:.2e}") # Around 20 tokens per parameterThe rule of thumb that corrected the over-parameterized era.| Era | Model | Parameters | Tokens per param |
|---|---|---|---|
| Pre-Chinchilla | GPT-3 | 175B | ~1.7 |
| Chinchilla paper | Chinchilla 70B | 70B | ~20 |
| Llama 2 era | Llama 2 70B | 70B | ~30 |
| Llama 3 / frontier | Llama 3 70B | 70B | ~210 |
| Over-training era | Small dense models | 1-8B | hundreds+ |
Many modern models deliberately over-train on data well past the Chinchilla-optimal ratio. The reason: inference cost dominates total cost once a model is deployed to millions of users. A smaller, over-trained model is cheaper to serve and still captures most of the quality.
By 2024-2025, pure pretraining scaling showed diminishing returns. Spending the same compute at inference — having the model think longer, sample many answers, verify, backtrack — unlocked further gains. OpenAI's o1 and o3 series, Anthropic's extended thinking, and DeepSeek's R1 all embody this inference-compute scaling law.
The exponent is small. The money spent chasing it is not.
— An infra lead at a frontier lab
The big idea: scaling laws made AI progress predictable. The current frontier is learning where the curves bend, and whether new algorithms can steepen them again.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-scaling-and-compute-optimal-training
What is the main idea of "Scaling Laws and Compute-Optimal Training"?
Which concept is most central to "Scaling Laws and Compute-Optimal Training"?
Which use of AI fits this topic best?
What should a careful learner remember about "Two axes now"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about scaling laws be treated?
Name one way to verify an AI answer about scaling laws.
Which action would help you apply "Scaling Laws and Compute-Optimal Training" responsibly?