Lesson 212 of 1570
The Mind-Boggling Scale of Modern Training Data
When we say trillions of tokens, we mean it. Let's make these numbers feel real with comparisons you can actually picture.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Numbers So Big They Stop Meaning Anything
- 2tokens
- 3scale
- 4dataset size
Concept cluster
Terms to connect while reading
Section 1
Numbers So Big They Stop Meaning Anything
Modern LLMs are trained on trillions of tokens. That number is meaningless until you put it next to something familiar. Let's try.
Compare the options
| Measure | Size | Comparison |
|---|---|---|
| One book | ~100k tokens | About 75k words |
| All of English Wikipedia | ~4 billion tokens | 30,000 books |
| GPT-3 training data (2020) | 500 billion tokens | 4 million books |
| Llama 3 training data (2024) | 15 trillion tokens | 120 million books |
| Human lifetime reading | ~1 billion tokens | 1 bookshelf |
The scaling law that changed everything
In 2020, a paper by Kaplan et al. showed that model performance scales predictably with data and parameters. Double the data, model gets better. Double the parameters, better again. This launched the race to build bigger datasets and bigger models, which we are still in.
Have we run out of data?
Researchers at Epoch AI estimate that high-quality English text on the public web is around 10 to 20 trillion tokens. Llama 3 used 15 trillion. GPT-5 class models may need more. The field is genuinely running into a ceiling, which is why synthetic data and multimodal (image, video, audio) scaling have become huge research areas.
Key terms in this lesson
The big idea: the datasets are genuinely at civilization scale now. The next leap will not come from more scraping, but from smarter filtering and new modalities.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “The Mind-Boggling Scale of Modern Training Data”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 30 min
Is the Model Reasoning or Pattern Matching?
The line between deep reasoning and clever pattern recognition is blurry. Here's how researchers try to tell them apart.
Builders · 28 min
BLEU, ROUGE, F1 — Automatic Metrics and Their Limits
Before LLMs-as-judges, researchers had hand-made metrics. They still matter — and still mislead.
Builders · 28 min
Bayesian Reasoning for Everyday Life
Bayes' rule is just 'update your belief with evidence.' It is shockingly useful.
