Loading lesson…
In 2020, a 175 billion parameter model and a parallel paper on scaling laws redefined what bigger could mean.
In May 2020, OpenAI published Language Models are Few-Shot Learners. GPT-3 was a Transformer with 175 billion parameters, a hundred times larger than GPT-2, trained on hundreds of billions of tokens drawn from Common Crawl, books, and Wikipedia.
Loss scales as a power law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude.
— Kaplan et al., 2020
The big idea: GPT-3 plus scaling laws turned AI into a bet on scale. For a while, the bet paid off relentlessly. Whether it continues to pay off is one of the central questions of current AI research.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-history-gpt3-scaling-creators
What is the core idea behind "GPT-3 and the Scaling Laws"?
Which term best describes a foundational idea in "GPT-3 and the Scaling Laws"?
A learner studying GPT-3 and the Scaling Laws would need to understand which concept?
Which of these is directly relevant to GPT-3 and the Scaling Laws?
Which of the following is a key point about GPT-3 and the Scaling Laws?
Which of these does NOT belong in a discussion of GPT-3 and the Scaling Laws?
Which statement is accurate regarding GPT-3 and the Scaling Laws?
Which of these does NOT belong in a discussion of GPT-3 and the Scaling Laws?
What is the key insight about "Why it felt qualitatively different" in the context of GPT-3 and the Scaling Laws?
Which statement accurately describes an aspect of GPT-3 and the Scaling Laws?
What does working with GPT-3 and the Scaling Laws typically involve?
Which best describes the scope of "GPT-3 and the Scaling Laws"?
Which section heading best belongs in a lesson about GPT-3 and the Scaling Laws?
Which section heading best belongs in a lesson about GPT-3 and the Scaling Laws?
Which of the following is a concept covered in GPT-3 and the Scaling Laws?