GPT-3 and the Scaling Laws

In 2020, a 175 billion parameter model and a parallel paper on scaling laws redefined what bigger could mean.

32 min · Reviewed 2026

A Qualitative Leap From a Quantitative Change

In May 2020, OpenAI published Language Models are Few-Shot Learners. GPT-3 was a Transformer with 175 billion parameters, a hundred times larger than GPT-2, trained on hundreds of billions of tokens drawn from Common Crawl, books, and Wikipedia.

The scaling laws paper

Loss scales as a power law with parameters, data, and compute
Larger models are more sample-efficient than smaller ones
Optimal compute allocation favors bigger models over longer training, for a given budget
DeepMind's Chinchilla paper in 2022 refined these laws and recommended more data relative to parameters

Loss scales as a power law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude.
— Kaplan et al., 2020

The critical subtleties

Scaling laws predict loss, not capabilities; capabilities emerge less predictably
Data quality matters as much as quantity, perhaps more
Scaling plateaus appear in specific tasks, even as average loss keeps dropping
Cost of frontier training grew from millions to hundreds of millions of dollars

The big idea: GPT-3 plus scaling laws turned AI into a bet on scale. For a while, the bet paid off relentlessly. Whether it continues to pay off is one of the central questions of current AI research.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-history-gpt3-scaling-creators

What is the core idea behind "GPT-3 and the Scaling Laws"?
1. In 2020, a 175 billion parameter model and a parallel paper on scaling laws redefined what bigger could mean.
2. Stop promising general intelligence by the end of the decade
3. DARPA
4. Gradients could flow directly through skip connections, dodging the vanishing gr…
Which term best describes a foundational idea in "GPT-3 and the Scaling Laws"?
1. scaling laws
2. GPT-3
3. few-shot
4. Chinchilla
A learner studying GPT-3 and the Scaling Laws would need to understand which concept?
1. GPT-3
2. few-shot
3. scaling laws
4. Chinchilla
Which of these is directly relevant to GPT-3 and the Scaling Laws?
1. GPT-3
2. scaling laws
3. Chinchilla
4. few-shot
Which of the following is a key point about GPT-3 and the Scaling Laws?
1. Loss scales as a power law with parameters, data, and compute
2. Larger models are more sample-efficient than smaller ones
3. Optimal compute allocation favors bigger models over longer training, for a given budget
4. DeepMind's Chinchilla paper in 2022 refined these laws and recommended more data relative to paramet…
Which of these does NOT belong in a discussion of GPT-3 and the Scaling Laws?
1. Larger models are more sample-efficient than smaller ones
2. Optimal compute allocation favors bigger models over longer training, for a given budget
3. Loss scales as a power law with parameters, data, and compute
4. Stop promising general intelligence by the end of the decade
Which statement is accurate regarding GPT-3 and the Scaling Laws?
1. Data quality matters as much as quantity, perhaps more
2. Scaling plateaus appear in specific tasks, even as average loss keeps dropping
3. Scaling laws predict loss, not capabilities; capabilities emerge less predictably
4. Cost of frontier training grew from millions to hundreds of millions of dollars
Which of these does NOT belong in a discussion of GPT-3 and the Scaling Laws?
1. Scaling laws predict loss, not capabilities; capabilities emerge less predictably
2. Data quality matters as much as quantity, perhaps more
3. Stop promising general intelligence by the end of the decade
4. Scaling plateaus appear in specific tasks, even as average loss keeps dropping
What is the key insight about "Why it felt qualitatively different" in the context of GPT-3 and the Scaling Laws?
1. Scale produced capabilities that were not present in smaller models.
2. Stop promising general intelligence by the end of the decade
3. DARPA
4. Gradients could flow directly through skip connections, dodging the vanishing gr…
Which statement accurately describes an aspect of GPT-3 and the Scaling Laws?
1. Stop promising general intelligence by the end of the decade
2. In May 2020, OpenAI published Language Models are Few-Shot Learners. GPT-3 was a Transformer with 175 billion parameters, a hundred times la…
3. DARPA
4. Gradients could flow directly through skip connections, dodging the vanishing gr…
What does working with GPT-3 and the Scaling Laws typically involve?
1. Stop promising general intelligence by the end of the decade
2. DARPA
3. The big idea: GPT-3 plus scaling laws turned AI into a bet on scale. For a while, the bet paid off relentlessly.
4. Gradients could flow directly through skip connections, dodging the vanishing gr…
Which best describes the scope of "GPT-3 and the Scaling Laws"?
1. It is unrelated to foundations workflows
2. It applies only to the opposite beginner tier
3. It was deprecated in 2024 and no longer relevant
4. It focuses on In 2020, a 175 billion parameter model and a parallel paper on scaling laws redefined what bigger co
Which section heading best belongs in a lesson about GPT-3 and the Scaling Laws?
1. The scaling laws paper
2. Stop promising general intelligence by the end of the decade
3. DARPA
4. Gradients could flow directly through skip connections, dodging the vanishing gr…
Which section heading best belongs in a lesson about GPT-3 and the Scaling Laws?
1. Stop promising general intelligence by the end of the decade
2. The critical subtleties
3. DARPA
4. Gradients could flow directly through skip connections, dodging the vanishing gr…
Which of the following is a concept covered in GPT-3 and the Scaling Laws?
1. scaling laws
2. few-shot
3. GPT-3
4. Chinchilla

← Back to interactive lesson

Tendril · Creators · AI Foundations

GPT-3 and the Scaling Laws

In 2020, a 175 billion parameter model and a parallel paper on scaling laws redefined what bigger could mean.

32 min · Reviewed 2026

A Qualitative Leap From a Quantitative Change

The scaling laws paper

Loss scales as a power law with parameters, data, and compute
Larger models are more sample-efficient than smaller ones
Optimal compute allocation favors bigger models over longer training, for a given budget
DeepMind's Chinchilla paper in 2022 refined these laws and recommended more data relative to parameters

Loss scales as a power law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude.
— Kaplan et al., 2020

The critical subtleties

Scaling laws predict loss, not capabilities; capabilities emerge less predictably
Data quality matters as much as quantity, perhaps more
Scaling plateaus appear in specific tasks, even as average loss keeps dropping
Cost of frontier training grew from millions to hundreds of millions of dollars

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-history-gpt3-scaling-creators

What is the core idea behind "GPT-3 and the Scaling Laws"?
1. In 2020, a 175 billion parameter model and a parallel paper on scaling laws redefined what bigger could mean.
2. Stop promising general intelligence by the end of the decade
3. DARPA
4. Gradients could flow directly through skip connections, dodging the vanishing gr…
Which term best describes a foundational idea in "GPT-3 and the Scaling Laws"?
1. scaling laws
2. GPT-3
3. few-shot
4. Chinchilla
A learner studying GPT-3 and the Scaling Laws would need to understand which concept?
1. GPT-3
2. few-shot
3. scaling laws
4. Chinchilla
Which of these is directly relevant to GPT-3 and the Scaling Laws?
1. GPT-3
2. scaling laws
3. Chinchilla
4. few-shot
Which of the following is a key point about GPT-3 and the Scaling Laws?
1. Loss scales as a power law with parameters, data, and compute
2. Larger models are more sample-efficient than smaller ones
3. Optimal compute allocation favors bigger models over longer training, for a given budget
4. DeepMind's Chinchilla paper in 2022 refined these laws and recommended more data relative to paramet…
Which of these does NOT belong in a discussion of GPT-3 and the Scaling Laws?
1. Larger models are more sample-efficient than smaller ones
2. Optimal compute allocation favors bigger models over longer training, for a given budget
3. Loss scales as a power law with parameters, data, and compute
4. Stop promising general intelligence by the end of the decade
Which statement is accurate regarding GPT-3 and the Scaling Laws?
1. Data quality matters as much as quantity, perhaps more
2. Scaling plateaus appear in specific tasks, even as average loss keeps dropping
3. Scaling laws predict loss, not capabilities; capabilities emerge less predictably
4. Cost of frontier training grew from millions to hundreds of millions of dollars
Which of these does NOT belong in a discussion of GPT-3 and the Scaling Laws?
1. Scaling laws predict loss, not capabilities; capabilities emerge less predictably
2. Data quality matters as much as quantity, perhaps more
3. Stop promising general intelligence by the end of the decade
4. Scaling plateaus appear in specific tasks, even as average loss keeps dropping
What is the key insight about "Why it felt qualitatively different" in the context of GPT-3 and the Scaling Laws?
1. Scale produced capabilities that were not present in smaller models.
2. Stop promising general intelligence by the end of the decade
3. DARPA
4. Gradients could flow directly through skip connections, dodging the vanishing gr…
Which statement accurately describes an aspect of GPT-3 and the Scaling Laws?
1. Stop promising general intelligence by the end of the decade
2. In May 2020, OpenAI published Language Models are Few-Shot Learners. GPT-3 was a Transformer with 175 billion parameters, a hundred times la…
3. DARPA
4. Gradients could flow directly through skip connections, dodging the vanishing gr…
What does working with GPT-3 and the Scaling Laws typically involve?
1. Stop promising general intelligence by the end of the decade
2. DARPA
3. The big idea: GPT-3 plus scaling laws turned AI into a bet on scale. For a while, the bet paid off relentlessly.
4. Gradients could flow directly through skip connections, dodging the vanishing gr…
Which best describes the scope of "GPT-3 and the Scaling Laws"?
1. It is unrelated to foundations workflows
2. It applies only to the opposite beginner tier
3. It was deprecated in 2024 and no longer relevant
4. It focuses on In 2020, a 175 billion parameter model and a parallel paper on scaling laws redefined what bigger co
Which section heading best belongs in a lesson about GPT-3 and the Scaling Laws?
1. The scaling laws paper
2. Stop promising general intelligence by the end of the decade
3. DARPA
4. Gradients could flow directly through skip connections, dodging the vanishing gr…
Which section heading best belongs in a lesson about GPT-3 and the Scaling Laws?
1. Stop promising general intelligence by the end of the decade
2. The critical subtleties
3. DARPA
4. Gradients could flow directly through skip connections, dodging the vanishing gr…
Which of the following is a concept covered in GPT-3 and the Scaling Laws?
1. scaling laws
2. few-shot
3. GPT-3
4. Chinchilla

← Back to interactive lesson