Scaling Laws: Why Bigger Worked

The past decade of AI progress came from a simple, ruthless law: more compute and more data, predictable improvements. Here is the math behind it.

30 min · Reviewed 2026

The Law That Changed Everything

In 2020, researchers at OpenAI published a paper showing that language model performance follows a predictable curve. Add more parameters, more data, and more compute, and loss goes down in a smooth, mathematical way. The industry has been chasing that curve ever since.

The three dials

Parameters (N): the number of weights in the model
Data (D): how many tokens the model trains on
Compute (C): total FLOPs spent during training, roughly N × D

The Chinchilla correction

The 2022 Chinchilla paper from DeepMind showed earlier models were undertrained. The optimal ratio is roughly 20 tokens of training data per parameter. Many older models had too many parameters and too little data, wasting compute.

Model	Parameters	Training tokens
GPT-3 (2020)	175B	300B
Chinchilla (2022)	70B	1.4T
Llama 3 (2024)	70B	15T
Modern frontier (2025-2026)	variable	tens of trillions

Diminishing but not stopping

Each scaling bump costs exponentially more money. Going from GPT-3 to GPT-4 reportedly cost over a hundred million dollars. The returns are real but costly. That is why inference efficiency, mixture of experts, and better data now matter as much as pure scale.

Limits and open questions

Public text data is running out. Synthetic and multimodal data fill the gap
Energy and GPU supply now bottleneck progress
Pure scaling may be hitting a ceiling without new algorithmic ideas
Test-time compute (thinking longer at inference) is the new frontier

The bitter lesson is that general methods that leverage computation are ultimately the most effective.
— Rich Sutton

The big idea: AI's recent leap was largely the result of executing the scaling recipe at industrial scale. Knowing the recipe demystifies both the progress and its current bottlenecks.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-scaling-laws

What is the core idea behind "Scaling Laws: Why Bigger Worked"?
1. The past decade of AI progress came from a simple, ruthless law: more compute and more data, predictable improvements. Here is the math behind it.
2. Tell it every time it gets a guess wrong
3. Enabling community-built tool ecosystems
4. AI might give you old prices, old scores, or old news.
Which term best describes a foundational idea in "Scaling Laws: Why Bigger Worked"?
1. parameters
2. scaling law
3. FLOPs
4. Chinchilla
A learner studying Scaling Laws: Why Bigger Worked would need to understand which concept?
1. scaling law
2. FLOPs
3. parameters
4. Chinchilla
Which of these is directly relevant to Scaling Laws: Why Bigger Worked?
1. scaling law
2. parameters
3. Chinchilla
4. FLOPs
Which of the following is a key point about Scaling Laws: Why Bigger Worked?
1. Parameters (N): the number of weights in the model
2. Data (D): how many tokens the model trains on
3. Compute (C): total FLOPs spent during training, roughly N × D
4. Tell it every time it gets a guess wrong
What is one important takeaway from studying Scaling Laws: Why Bigger Worked?
1. Energy and GPU supply now bottleneck progress
2. Public text data is running out. Synthetic and multimodal data fill the gap
3. Pure scaling may be hitting a ceiling without new algorithmic ideas
4. Test-time compute (thinking longer at inference) is the new frontier
Which of these does NOT belong in a discussion of Scaling Laws: Why Bigger Worked?
1. Public text data is running out. Synthetic and multimodal data fill the gap
2. Energy and GPU supply now bottleneck progress
3. Tell it every time it gets a guess wrong
4. Pure scaling may be hitting a ceiling without new algorithmic ideas
What is the key insight about "Power laws, not linear" in the context of Scaling Laws: Why Bigger Worked?
1. Tell it every time it gets a guess wrong
2. Enabling community-built tool ecosystems
3. AI might give you old prices, old scores, or old news.
4. Double your compute, and loss drops by a predictable fraction, not half.
What is the recommended tip about "Build your mental model" in the context of Scaling Laws: Why Bigger Worked?
1. AI isn't magic — it's pattern recognition at scale. The more you understand how it works, the more effectively you can u…
2. Tell it every time it gets a guess wrong
3. Enabling community-built tool ecosystems
4. AI might give you old prices, old scores, or old news.
Which statement accurately describes an aspect of Scaling Laws: Why Bigger Worked?
1. Tell it every time it gets a guess wrong
2. In 2020, researchers at OpenAI published a paper showing that language model performance follows a predictable curve.
3. Enabling community-built tool ecosystems
4. AI might give you old prices, old scores, or old news.
What does working with Scaling Laws: Why Bigger Worked typically involve?
1. Tell it every time it gets a guess wrong
2. Enabling community-built tool ecosystems
3. The 2022 Chinchilla paper from DeepMind showed earlier models were undertrained.
4. AI might give you old prices, old scores, or old news.
Which of the following is true about Scaling Laws: Why Bigger Worked?
1. Tell it every time it gets a guess wrong
2. Enabling community-built tool ecosystems
3. AI might give you old prices, old scores, or old news.
4. Each scaling bump costs exponentially more money. Going from GPT-3 to GPT-4 reportedly cost over a hundred million dollars.
Which best describes the scope of "Scaling Laws: Why Bigger Worked"?
1. It focuses on The past decade of AI progress came from a simple, ruthless law: more compute and more data, predict
2. It is unrelated to foundations workflows
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Scaling Laws: Why Bigger Worked?
1. Tell it every time it gets a guess wrong
2. The three dials
3. Enabling community-built tool ecosystems
4. AI might give you old prices, old scores, or old news.
Which section heading best belongs in a lesson about Scaling Laws: Why Bigger Worked?
1. Tell it every time it gets a guess wrong
2. Enabling community-built tool ecosystems
3. The Chinchilla correction
4. AI might give you old prices, old scores, or old news.

← Back to interactive lesson

Tendril · Builders · AI Foundations

Scaling Laws: Why Bigger Worked

The past decade of AI progress came from a simple, ruthless law: more compute and more data, predictable improvements. Here is the math behind it.

30 min · Reviewed 2026

The Law That Changed Everything

The three dials

Parameters (N): the number of weights in the model
Data (D): how many tokens the model trains on
Compute (C): total FLOPs spent during training, roughly N × D

The Chinchilla correction

Model	Parameters	Training tokens
GPT-3 (2020)	175B	300B
Chinchilla (2022)	70B	1.4T
Llama 3 (2024)	70B	15T
Modern frontier (2025-2026)	variable	tens of trillions

Diminishing but not stopping

Limits and open questions

Public text data is running out. Synthetic and multimodal data fill the gap
Energy and GPU supply now bottleneck progress
Pure scaling may be hitting a ceiling without new algorithmic ideas
Test-time compute (thinking longer at inference) is the new frontier

The bitter lesson is that general methods that leverage computation are ultimately the most effective.
— Rich Sutton

The big idea: AI's recent leap was largely the result of executing the scaling recipe at industrial scale. Knowing the recipe demystifies both the progress and its current bottlenecks.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-scaling-laws

What is the core idea behind "Scaling Laws: Why Bigger Worked"?
1. The past decade of AI progress came from a simple, ruthless law: more compute and more data, predictable improvements. Here is the math behind it.
2. Tell it every time it gets a guess wrong
3. Enabling community-built tool ecosystems
4. AI might give you old prices, old scores, or old news.
Which term best describes a foundational idea in "Scaling Laws: Why Bigger Worked"?
1. parameters
2. scaling law
3. FLOPs
4. Chinchilla
A learner studying Scaling Laws: Why Bigger Worked would need to understand which concept?
1. scaling law
2. FLOPs
3. parameters
4. Chinchilla
Which of these is directly relevant to Scaling Laws: Why Bigger Worked?
1. scaling law
2. parameters
3. Chinchilla
4. FLOPs
Which of the following is a key point about Scaling Laws: Why Bigger Worked?
1. Parameters (N): the number of weights in the model
2. Data (D): how many tokens the model trains on
3. Compute (C): total FLOPs spent during training, roughly N × D
4. Tell it every time it gets a guess wrong
What is one important takeaway from studying Scaling Laws: Why Bigger Worked?
1. Energy and GPU supply now bottleneck progress
2. Public text data is running out. Synthetic and multimodal data fill the gap
3. Pure scaling may be hitting a ceiling without new algorithmic ideas
4. Test-time compute (thinking longer at inference) is the new frontier
Which of these does NOT belong in a discussion of Scaling Laws: Why Bigger Worked?
1. Public text data is running out. Synthetic and multimodal data fill the gap
2. Energy and GPU supply now bottleneck progress
3. Tell it every time it gets a guess wrong
4. Pure scaling may be hitting a ceiling without new algorithmic ideas
What is the key insight about "Power laws, not linear" in the context of Scaling Laws: Why Bigger Worked?
1. Tell it every time it gets a guess wrong
2. Enabling community-built tool ecosystems
3. AI might give you old prices, old scores, or old news.
4. Double your compute, and loss drops by a predictable fraction, not half.
What is the recommended tip about "Build your mental model" in the context of Scaling Laws: Why Bigger Worked?
1. AI isn't magic — it's pattern recognition at scale. The more you understand how it works, the more effectively you can u…
2. Tell it every time it gets a guess wrong
3. Enabling community-built tool ecosystems
4. AI might give you old prices, old scores, or old news.
Which statement accurately describes an aspect of Scaling Laws: Why Bigger Worked?
1. Tell it every time it gets a guess wrong
2. In 2020, researchers at OpenAI published a paper showing that language model performance follows a predictable curve.
3. Enabling community-built tool ecosystems
4. AI might give you old prices, old scores, or old news.
What does working with Scaling Laws: Why Bigger Worked typically involve?
1. Tell it every time it gets a guess wrong
2. Enabling community-built tool ecosystems
3. The 2022 Chinchilla paper from DeepMind showed earlier models were undertrained.
4. AI might give you old prices, old scores, or old news.
Which of the following is true about Scaling Laws: Why Bigger Worked?
1. Tell it every time it gets a guess wrong
2. Enabling community-built tool ecosystems
3. AI might give you old prices, old scores, or old news.
4. Each scaling bump costs exponentially more money. Going from GPT-3 to GPT-4 reportedly cost over a hundred million dollars.
Which best describes the scope of "Scaling Laws: Why Bigger Worked"?
1. It focuses on The past decade of AI progress came from a simple, ruthless law: more compute and more data, predict
2. It is unrelated to foundations workflows
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Scaling Laws: Why Bigger Worked?
1. Tell it every time it gets a guess wrong
2. The three dials
3. Enabling community-built tool ecosystems
4. AI might give you old prices, old scores, or old news.
Which section heading best belongs in a lesson about Scaling Laws: Why Bigger Worked?
1. Tell it every time it gets a guess wrong
2. Enabling community-built tool ecosystems
3. The Chinchilla correction
4. AI might give you old prices, old scores, or old news.

← Back to interactive lesson