Lesson 8 of 1570
Scaling Laws: Why Bigger Worked
The past decade of AI progress came from a simple, ruthless law: more compute and more data, predictable improvements. Here is the math behind it.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Law That Changed Everything
- 2scaling laws
- 3compute
- 4parameters
Concept cluster
Terms to connect while reading
Section 1
The Law That Changed Everything
In 2020, researchers at OpenAI published a paper showing that language model performance follows a predictable curve. Add more parameters, more data, and more compute, and loss goes down in a smooth, mathematical way. The industry has been chasing that curve ever since.
The three dials
- Parameters (N): the number of weights in the model
- Data (D): how many tokens the model trains on
- Compute (C): total FLOPs spent during training, roughly N × D
The Chinchilla correction
The 2022 Chinchilla paper from DeepMind showed earlier models were undertrained. The optimal ratio is roughly 20 tokens of training data per parameter. Many older models had too many parameters and too little data, wasting compute.
Compare the options
| Model | Parameters | Training tokens |
|---|---|---|
| GPT-3 (2020) | 175B | 300B |
| Chinchilla (2022) | 70B | 1.4T |
| Llama 3 (2024) | 70B | 15T |
| Modern frontier (2025-2026) | variable | tens of trillions |
Diminishing but not stopping
Each scaling bump costs exponentially more money. Going from GPT-3 to GPT-4 reportedly cost over a hundred million dollars. The returns are real but costly. That is why inference efficiency, mixture of experts, and better data now matter as much as pure scale.
Limits and open questions
- Public text data is running out. Synthetic and multimodal data fill the gap
- Energy and GPU supply now bottleneck progress
- Pure scaling may be hitting a ceiling without new algorithmic ideas
- Test-time compute (thinking longer at inference) is the new frontier
“The bitter lesson is that general methods that leverage computation are ultimately the most effective.”
Key terms in this lesson
The big idea: AI's recent leap was largely the result of executing the scaling recipe at industrial scale. Knowing the recipe demystifies both the progress and its current bottlenecks.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Scaling Laws: Why Bigger Worked”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 22 min
The Mind-Boggling Scale of Modern Training Data
When we say trillions of tokens, we mean it. Let's make these numbers feel real with comparisons you can actually picture.
Builders · 30 min
The Supervised Learning Loop
Most modern AI is trained on a loop of guess, check, and adjust. Understand the loop and you understand the heart of machine learning.
Builders · 30 min
Tokens and Embeddings: How AI Reads Words
AI does not read letters. It reads tokens, which live as vectors in a space of meaning. Learn how text becomes numbers you can do math on.
