Tendril

Lesson 8 of 1570

Scaling Laws: Why Bigger Worked

The past decade of AI progress came from a simple, ruthless law: more compute and more data, predictable improvements. Here is the math behind it.

BuildersAI Foundations~18 min readIntermediateBI3 · LearningBI2 · Representation & ReasoningPrint / PDF

Lesson map

What this lesson covers

30 min17 blocks4 concepts

Learning path

The main moves in order

1The Law That Changed Everything
2scaling laws
3compute
4parameters

Concept cluster

Terms to connect while reading

scaling lawscomputeparametersChinchilla

Sections5

Lists2

Notes3

Compare1

Quotes1

Section 1

The Law That Changed Everything

In 2020, researchers at OpenAI published a paper showing that language model performance follows a predictable curve. Add more parameters, more data, and more compute, and loss goes down in a smooth, mathematical way. The industry has been chasing that curve ever since.

The three dials

Parameters (N): the number of weights in the model
Data (D): how many tokens the model trains on
Compute (C): total FLOPs spent during training, roughly N × D

Check-in 1. Got it so far?

The Chinchilla correction

The 2022 Chinchilla paper from DeepMind showed earlier models were undertrained. The optimal ratio is roughly 20 tokens of training data per parameter. Many older models had too many parameters and too little data, wasting compute.

Compare the options

Model	Parameters	Training tokens
GPT-3 (2020)	175B	300B
Chinchilla (2022)	70B	1.4T
Llama 3 (2024)	70B	15T
Modern frontier (2025-2026)	variable	tens of trillions

Diminishing but not stopping

Each scaling bump costs exponentially more money. Going from GPT-3 to GPT-4 reportedly cost over a hundred million dollars. The returns are real but costly. That is why inference efficiency, mixture of experts, and better data now matter as much as pure scale.

Check-in 2. Got it so far?

Limits and open questions

Public text data is running out. Synthetic and multimodal data fill the gap
Energy and GPU supply now bottleneck progress
Pure scaling may be hitting a ceiling without new algorithmic ideas
Test-time compute (thinking longer at inference) is the new frontier

Check-in 3. Got it so far?

“The bitter lesson is that general methods that leverage computation are ultimately the most effective.”
Rich Sutton

Key terms in this lesson

The big idea: AI's recent leap was largely the result of executing the scaling recipe at industrial scale. Knowing the recipe demystifies both the progress and its current bottlenecks.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Scaling Laws: Why Bigger Worked”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Scaling Laws: Why Bigger Worked

The Law That Changed Everything

The three dials

The Chinchilla correction

Diminishing but not stopping

Limits and open questions

Curious about “Scaling Laws: Why Bigger Worked”?

Keep going

Scaling Laws: Why Bigger Worked

The Law That Changed Everything

The three dials

The Chinchilla correction

Diminishing but not stopping

Limits and open questions

Curious about “Scaling Laws: Why Bigger Worked”?

Keep going