Lesson 215 of 1570
Quality Filtering: Separating Signal From Noise
The raw web is 99 percent garbage. Filtering it down to the 1 percent worth training on is one of the highest-leverage steps in modern AI.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Most of the Internet Is Garbage
- 2quality filtering
- 3perplexity
- 4heuristics
Concept cluster
Terms to connect while reading
Section 1
Most of the Internet Is Garbage
If you scrape the open web, most of what you get is spam, SEO-generated junk, parked domains full of ads, auto-translated machine output, and boilerplate. Training a model on this raw mess makes a worse model than training on a smaller but cleaner subset.
Common filter techniques
- 1Language ID (keep only target languages)
- 2Length filters (drop pages under 50 words)
- 3Character-level rules (drop pages with too much punctuation)
- 4Repetition filters (drop pages where one line repeats 50 times)
- 5Perplexity filters (use a small LM to score how natural the text looks)
- 6Classifier-based filters (train a model on Wikipedia vs. random web)
The perplexity trick
Train a small language model on a known-good corpus like Wikipedia. Run it on any web page and compute its perplexity, a measure of how surprising the text is. Random spam has very high perplexity (unpredictable garbage). Genuine writing has low perplexity. Keep only the low-perplexity pages.
Classifier filters
Take a sample labeled as high-quality (curated lists, Wikipedia, reference books) and another sample of random web text. Train a binary classifier. Then run it on the full corpus and keep the pages it labels high-quality. GPT-3 famously used this approach with a simple logistic regression.
A GPT-3 style quality classifier
# Pseudocode for a simple quality classifier
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import LogisticRegression
# Labels: 1 = high quality, 0 = random web
X_train = [...] # list of documents
y_train = [...] # 1s for curated, 0s for random
vec = HashingVectorizer(n_features=2**18)
clf = LogisticRegression()
clf.fit(vec.transform(X_train), y_train)
# Score a new document
score = clf.predict_proba(vec.transform(['some new page']))[0, 1]
if score > 0.7:
keep_this_page()Key terms in this lesson
The big idea: quality filtering is where small labs and big labs actually differ. Anyone can download Common Crawl. Not everyone can filter it well.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Quality Filtering: Separating Signal From Noise”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 25 min
Deduplication: Why Repeats Hurt Models
If the same paragraph appears a million times in your training data, your model will memorize it. Deduplication quietly makes AI better.
Builders · 22 min
The Turing Test and Its Discontents
The imitation game became famous, but most AI researchers now think it measures the wrong thing.
Builders · 30 min
Tokens and Embeddings: How AI Reads Words
AI does not read letters. It reads tokens, which live as vectors in a space of meaning. Learn how text becomes numbers you can do math on.
