Lesson 5 of 2116
Scaling Laws and Compute-Optimal Training
Dive into the equations that governed the last five years of AI progress, and the fresh questions they raise now that pure scaling is hitting walls.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1From Empirical Curve to Strategic Doctrine
- 2scaling laws
- 3Chinchilla
- 4FLOPs
Concept cluster
Terms to connect while reading
Section 1
From Empirical Curve to Strategic Doctrine
Kaplan et al. (2020) showed that LLM loss follows smooth power laws in parameters, data, and compute. Hoffmann et al. (2022, the Chinchilla paper) showed that under a compute budget, you want to balance N and D rather than scaling parameters alone. These two papers shaped over a hundred billion dollars of capex.
The Kaplan power law
Loss L as a function of compute C behaves approximately like L(C) ≈ a · C^(-α) for some constants. Separately for parameters N and data D with their own exponents. The exponent α is small (around 0.05 to 0.1), meaning you need large increases in compute to get modest loss drops.
The rule of thumb that corrected the over-parameterized era.
# Rough Chinchilla-optimal scaling
def compute_optimal(flops):
# From Hoffmann et al. (2022):
# Optimal parameters N ≈ 0.6 * sqrt(C / 6)
# Optimal tokens D ≈ C / (6 * N)
N = 0.6 * (flops / 6) ** 0.5
D = flops / (6 * N)
return N, D
N, D = compute_optimal(1e24) # 1e24 FLOPs budget
print(f"Params: {N:.2e}, Tokens: {D:.2e}")
# Around 20 tokens per parameterWhy Chinchilla mattered
Compare the options
| Era | Model | Parameters | Tokens per param |
|---|---|---|---|
| Pre-Chinchilla | GPT-3 | 175B | ~1.7 |
| Chinchilla paper | Chinchilla 70B | 70B | ~20 |
| Llama 2 era | Llama 2 70B | 70B | ~30 |
| Llama 3 / frontier | Llama 3 70B | 70B | ~210 |
| Over-training era | Small dense models | 1-8B | hundreds+ |
Many modern models deliberately over-train on data well past the Chinchilla-optimal ratio. The reason: inference cost dominates total cost once a model is deployed to millions of users. A smaller, over-trained model is cheaper to serve and still captures most of the quality.
The pivot to test-time compute
By 2024-2025, pure pretraining scaling showed diminishing returns. Spending the same compute at inference — having the model think longer, sample many answers, verify, backtrack — unlocked further gains. OpenAI's o1 and o3 series, Anthropic's extended thinking, and DeepSeek's R1 all embody this inference-compute scaling law.
Bottlenecks beyond FLOPs
- Data: public high-quality text is mostly scraped. Multimodal, synthetic, and licensed data fill gaps.
- Energy: frontier training runs approach gigawatt scales
- Talent: the real throttle on how many frontier labs can exist
- Alignment: as capability scales, so does the cost of shaping behavior
What this means for your work
- 1If you fine-tune, start with a compute-optimal base, not the largest
- 2Benchmarks at one size often do not predict another size
- 3Budget test-time compute explicitly in your product design
- 4Watch for regimes where smaller, smarter beats bigger, dumber
“The exponent is small. The money spent chasing it is not.”
Key terms in this lesson
The big idea: scaling laws made AI progress predictable. The current frontier is learning where the curves bend, and whether new algorithms can steepen them again.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Scaling Laws and Compute-Optimal Training”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 35 min
How Chatbot Arena Works
The world's most influential 'leaderboard' for AI is not a test — it is humans voting blindly. Here is how that works.
Creators · 38 min
Benchmark Contamination
When the test questions quietly end up in the training data, scores lie. Here is how it happens and how to catch it.
Creators · 35 min
Multimodal Benchmarks
Evaluating models that see, hear, and read at once requires new kinds of tests. Here are the ones that matter.
