Lesson 215 of 1455
Quality Filtering: Separating Signal From Noise
The raw web is 99 percent garbage. Filtering it down to the 1 percent worth training on is one of the highest-leverage steps in modern AI.
Builders · AI Foundations · ~17 min read
Most of the Internet Is Garbage
If you scrape the open web, most of what you get is spam, SEO-generated junk, parked domains full of ads, auto-translated machine output, and boilerplate. Training a model on this raw mess makes a worse model than training on a smaller but cleaner subset.
Common filter techniques
- 1Language ID (keep only target languages)
- 2Length filters (drop pages under 50 words)
- 3Character-level rules (drop pages with too much punctuation)
- 4Repetition filters (drop pages where one line repeats 50 times)
- 5Perplexity filters (use a small LM to score how natural the text looks)
- 6Classifier-based filters (train a model on Wikipedia vs. random web)
The perplexity trick
Train a small language model on a known-good corpus like Wikipedia. Run it on any web page and compute its perplexity, a measure of how surprising the text is. Random spam has very high perplexity (unpredictable garbage). Genuine writing has low perplexity. Keep only the low-perplexity pages.
Classifier filters
Take a sample labeled as high-quality (curated lists, Wikipedia, reference books) and another sample of random web text. Train a binary classifier. Then run it on the full corpus and keep the pages it labels high-quality. GPT-3 famously used this approach with a simple logistic regression.
A GPT-3 style quality classifier
# Pseudocode for a simple quality classifier from sklearn.feature_extraction.text import HashingVectorizer from sklearn.linear_model import LogisticRegression # Labels: 1 = high quality, 0 = random web X_train = [] # list of documents y_train = [] # 1s for curated, 0s for random vec = HashingVectorizer(n_features=2**18) clf = LogisticRegression() clf.fit(vec.transform(X_train), y_train) # Score a new document score = clf.predict_proba(vec.transform(['some new page']))[0, 1] if score > 0.7: keep_this_page()Key terms in this lesson
The big idea: quality filtering is where small labs and big labs actually differ. Anyone can download Common Crawl. Not everyone can filter it well.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Lesson help
Questions are best handled with a grown-up here.
For this age range, Tendril keeps freeform AI chat paused until parent/guardian consent and child-safe moderation are fully verified. Use the quiz, notes, and related lessons below, or ask a parent, guardian, teacher, or librarian to work through the question with you.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 25 min
Deduplication: Why Repeats Hurt Models
If the same paragraph appears a million times in your training data, your model will memorize it. Deduplication quietly makes AI better.
Builders · 22 min
The Turing Test and Its Discontents
The imitation game became famous, but most AI researchers now think it measures the wrong thing.
Builders · 30 min
Tokens and Embeddings: How AI Reads Words
AI does not read letters. It reads tokens, which live as vectors in a space of meaning. Learn how text becomes numbers you can do math on.
