Lesson 1741 of 2116
Chinchilla Scaling Laws: How Much Data Does an AI Model Need
Chinchilla showed that compute-optimal models scale data and parameters together; the rule has shifted with inference economics.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The premise
- 2scaling laws
- 3Chinchilla
- 4compute optimal
Concept cluster
Terms to connect while reading
Section 1
The premise
DeepMind's Chinchilla showed roughly 20 tokens per parameter is compute-optimal for training. But Llama-3 trained on 15 trillion tokens for 70B parameters because inference compute, not training, dominates lifecycle cost.
What AI does well here
- Predict loss as a function of parameters and tokens
- Guide pretraining budgets across model sizes
- Help right-size models for known compute budgets
What AI cannot do
- Capture data quality differences across pretraining corpora
- Predict downstream-task performance as cleanly as loss
- Account for fine-tuning and RLHF effects on final quality
Key terms in this lesson
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Chinchilla Scaling Laws: How Much Data Does an AI Model Need”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 30 min
Scaling Laws: Why Bigger Worked
The past decade of AI progress came from a simple, ruthless law: more compute and more data, predictable improvements. Here is the math behind it.
Creators · 50 min
Scaling Laws and Compute-Optimal Training
Dive into the equations that governed the last five years of AI progress, and the fresh questions they raise now that pure scaling is hitting walls.
Creators · 32 min
GPT-3 and the Scaling Laws
In 2020, a 175 billion parameter model and a parallel paper on scaling laws redefined what bigger could mean.
