Loading lesson…
When we say trillions of tokens, we mean it. Let's make these numbers feel real with comparisons you can actually picture.
Modern LLMs are trained on trillions of tokens. That number is meaningless until you put it next to something familiar. Let's try.
| Measure | Size | Comparison |
|---|---|---|
| One book | ~100k tokens | About 75k words |
| All of English Wikipedia | ~4 billion tokens | 30,000 books |
| GPT-3 training data (2020) | 500 billion tokens | 4 million books |
| Llama 3 training data (2024) | 15 trillion tokens | 120 million books |
| Human lifetime reading | ~1 billion tokens | 1 bookshelf |
In 2020, a paper by Kaplan et al. showed that model performance scales predictably with data and parameters. Double the data, model gets better. Double the parameters, better again. This launched the race to build bigger datasets and bigger models, which we are still in.
Researchers at Epoch AI estimate that high-quality English text on the public web is around 10 to 20 trillion tokens. Llama 3 used 15 trillion. GPT-5 class models may need more. The field is genuinely running into a ceiling, which is why synthetic data and multimodal (image, video, audio) scaling have become huge research areas.
The big idea: the datasets are genuinely at civilization scale now. The next leap will not come from more scraping, but from smarter filtering and new modalities.
6 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-scale-of-modern-datasets
What is the main idea of "The Mind-Boggling Scale of Modern Training Data"?
Which concept is most central to "The Mind-Boggling Scale of Modern Training Data"?
What should a careful learner remember about "Tokens vs. words"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about tokens be treated?
Name one way to verify an AI answer about tokens.