Tendril

Lesson 212 of 1570

The Mind-Boggling Scale of Modern Training Data

When we say trillions of tokens, we mean it. Let's make these numbers feel real with comparisons you can actually picture.

BuildersAI Foundations~13 min readIntermediateResearcherBI3 · LearningBI2 · Representation & ReasoningPrint / PDF

Lesson map

What this lesson covers

22 min13 blocks4 concepts

Learning path

The main moves in order

1Numbers So Big They Stop Meaning Anything
2tokens
3scale
4dataset size

Concept cluster

Terms to connect while reading

tokensscaledataset sizecompute

Sections3

Notes4

Compare1

Terms1

Section 1

Numbers So Big They Stop Meaning Anything

Modern LLMs are trained on trillions of tokens. That number is meaningless until you put it next to something familiar. Let's try.

Compare the options

Measure	Size	Comparison
One book	~100k tokens	About 75k words
All of English Wikipedia	~4 billion tokens	30,000 books
GPT-3 training data (2020)	500 billion tokens	4 million books
Llama 3 training data (2024)	15 trillion tokens	120 million books
Human lifetime reading	~1 billion tokens	1 bookshelf

Check-in 1. Got it so far?

The scaling law that changed everything

In 2020, a paper by Kaplan et al. showed that model performance scales predictably with data and parameters. Double the data, model gets better. Double the parameters, better again. This launched the race to build bigger datasets and bigger models, which we are still in.

Have we run out of data?

Researchers at Epoch AI estimate that high-quality English text on the public web is around 10 to 20 trillion tokens. Llama 3 used 15 trillion. GPT-5 class models may need more. The field is genuinely running into a ceiling, which is why synthetic data and multimodal (image, video, audio) scaling have become huge research areas.

Check-in 2. Got it so far?

Key terms in this lesson

The big idea: the datasets are genuinely at civilization scale now. The next leap will not come from more scraping, but from smarter filtering and new modalities.

Check-in 3. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “The Mind-Boggling Scale of Modern Training Data”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

The Mind-Boggling Scale of Modern Training Data

Numbers So Big They Stop Meaning Anything

The scaling law that changed everything

Have we run out of data?

Curious about “The Mind-Boggling Scale of Modern Training Data”?

Keep going

The Mind-Boggling Scale of Modern Training Data

Numbers So Big They Stop Meaning Anything

The scaling law that changed everything

Have we run out of data?

Curious about “The Mind-Boggling Scale of Modern Training Data”?

Keep going