Lesson 5 of 1570
Where Training Data Actually Comes From
You cannot understand modern AI without understanding its diet. Let's map where the data comes from, how it gets cleaned, and what that means.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Data Is the Real Fuel
- 2training data
- 3web scraping
- 4datasets
Concept cluster
Terms to connect while reading
Section 1
Data Is the Real Fuel
The weights get the credit, but the data does the work. Change nothing about the architecture and swap in better data, and the model gets noticeably smarter. That is why top labs spend huge sums on data curation.
Where text comes from
- Common Crawl: a free, massive scrape of the public web
- Wikipedia and other reference wikis
- Books: public domain plus commercial libraries
- Academic papers and preprints
- Code repositories on GitHub and elsewhere
- Synthetic data: generated by smaller models
The cleaning pipeline
Raw web data is a mess. Duplicates, spam, broken HTML, hateful text. Before training, teams spend months filtering and deduplicating. A rule of thumb: at least half of any raw crawl gets tossed.
- 1Deduplicate near-identical pages
- 2Filter by language and quality heuristics
- 3Strip HTML and normalize Unicode
- 4Remove PII and known toxic content
- 5Up-sample high-value domains like science and textbooks
RLHF: humans in the loop
After pretraining, companies hire people to rate model outputs. Better responses are rewarded. Worse ones are penalized. This method, called reinforcement learning from human feedback, is what turns a raw text predictor into a helpful assistant.
Why the diet matters
- Biases in data become biases in the model
- Coverage gaps mean the model is weak in some languages or topics
- Recency is limited by the training cutoff date
- Synthetic data can cause a feedback loop if overused
“Garbage in, garbage out is the oldest rule in computing, and AI did not escape it.”
Key terms in this lesson
The big idea: an AI is a reflection of the text it was raised on. Understanding the data makes model behavior much less mysterious.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Where Training Data Actually Comes From”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 28 min
LAION and the Image Training Story
Stable Diffusion, Midjourney, and DALL-E all trace back to LAION, an open dataset of 5 billion image-text pairs. It changed AI, and started a legal storm.
Builders · 22 min
The Turing Test and Its Discontents
The imitation game became famous, but most AI researchers now think it measures the wrong thing.
Builders · 28 min
Statistics Class: Letting AI Handle the Arithmetic
Stats is 10 percent concepts and 90 percent careful arithmetic. AI is shockingly good at the arithmetic, which frees you to actually think about the concepts.
