Tendril

Lesson 5 of 1570

Where Training Data Actually Comes From

You cannot understand modern AI without understanding its diet. Let's map where the data comes from, how it gets cleaned, and what that means.

BuildersAI Foundations~18 min readIntermediateCoderResearcherBI3 · LearningBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

30 min18 blocks7 concepts

Learning path

The main moves in order

1Data Is the Real Fuel
2training data
3web scraping
4datasets

Concept cluster

Terms to connect while reading

training dataweb scrapingdatasetsRLHFCommon CrawlWikipedia

Sections5

Lists3

Notes4

Quotes1

Terms1

Section 1

Data Is the Real Fuel

The weights get the credit, but the data does the work. Change nothing about the architecture and swap in better data, and the model gets noticeably smarter. That is why top labs spend huge sums on data curation.

Where text comes from

Common Crawl: a free, massive scrape of the public web
Wikipedia and other reference wikis
Books: public domain plus commercial libraries
Academic papers and preprints
Code repositories on GitHub and elsewhere
Synthetic data: generated by smaller models

The cleaning pipeline

Raw web data is a mess. Duplicates, spam, broken HTML, hateful text. Before training, teams spend months filtering and deduplicating. A rule of thumb: at least half of any raw crawl gets tossed.

Check-in 1. Got it so far?

1Deduplicate near-identical pages
2Filter by language and quality heuristics
3Strip HTML and normalize Unicode
4Remove PII and known toxic content
5Up-sample high-value domains like science and textbooks

RLHF: humans in the loop

After pretraining, companies hire people to rate model outputs. Better responses are rewarded. Worse ones are penalized. This method, called reinforcement learning from human feedback, is what turns a raw text predictor into a helpful assistant.

Check-in 2. Got it so far?

Why the diet matters

Biases in data become biases in the model
Coverage gaps mean the model is weak in some languages or topics
Recency is limited by the training cutoff date
Synthetic data can cause a feedback loop if overused

“Garbage in, garbage out is the oldest rule in computing, and AI did not escape it.”
An ML engineer

Check-in 3. Got it so far?

Key terms in this lesson

The big idea: an AI is a reflection of the text it was raised on. Understanding the data makes model behavior much less mysterious.

Check-in 4. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Where Training Data Actually Comes From”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Where Training Data Actually Comes From

Data Is the Real Fuel

Where text comes from

The cleaning pipeline

RLHF: humans in the loop

Why the diet matters

Curious about “Where Training Data Actually Comes From”?

Keep going

Where Training Data Actually Comes From

Data Is the Real Fuel

Where text comes from

The cleaning pipeline

RLHF: humans in the loop

Why the diet matters

Curious about “Where Training Data Actually Comes From”?

Keep going