Loading lesson…
The past decade of AI progress came from a simple, ruthless law: more compute and more data, predictable improvements. Here is the math behind it.
In 2020, researchers at OpenAI published a paper showing that language model performance follows a predictable curve. Add more parameters, more data, and more compute, and loss goes down in a smooth, mathematical way. The industry has been chasing that curve ever since.
The 2022 Chinchilla paper from DeepMind showed earlier models were undertrained. The optimal ratio is roughly 20 tokens of training data per parameter. Many older models had too many parameters and too little data, wasting compute.
| Model | Parameters | Training tokens |
|---|---|---|
| GPT-3 (2020) | 175B | 300B |
| Chinchilla (2022) | 70B | 1.4T |
| Llama 3 (2024) | 70B | 15T |
| Modern frontier (2025-2026) | variable | tens of trillions |
Each scaling bump costs exponentially more money. Going from GPT-3 to GPT-4 reportedly cost over a hundred million dollars. The returns are real but costly. That is why inference efficiency, mixture of experts, and better data now matter as much as pure scale.
The bitter lesson is that general methods that leverage computation are ultimately the most effective.
— Rich Sutton
The big idea: AI's recent leap was largely the result of executing the scaling recipe at industrial scale. Knowing the recipe demystifies both the progress and its current bottlenecks.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-scaling-laws
What is the core idea behind "Scaling Laws: Why Bigger Worked"?
Which term best describes a foundational idea in "Scaling Laws: Why Bigger Worked"?
A learner studying Scaling Laws: Why Bigger Worked would need to understand which concept?
Which of these is directly relevant to Scaling Laws: Why Bigger Worked?
Which of the following is a key point about Scaling Laws: Why Bigger Worked?
What is one important takeaway from studying Scaling Laws: Why Bigger Worked?
Which of these does NOT belong in a discussion of Scaling Laws: Why Bigger Worked?
What is the key insight about "Power laws, not linear" in the context of Scaling Laws: Why Bigger Worked?
What is the recommended tip about "Build your mental model" in the context of Scaling Laws: Why Bigger Worked?
Which statement accurately describes an aspect of Scaling Laws: Why Bigger Worked?
What does working with Scaling Laws: Why Bigger Worked typically involve?
Which of the following is true about Scaling Laws: Why Bigger Worked?
Which best describes the scope of "Scaling Laws: Why Bigger Worked"?
Which section heading best belongs in a lesson about Scaling Laws: Why Bigger Worked?
Which section heading best belongs in a lesson about Scaling Laws: Why Bigger Worked?