Tendril

Lesson 215 of 1570

Quality Filtering: Separating Signal From Noise

The raw web is 99 percent garbage. Filtering it down to the 1 percent worth training on is one of the highest-leverage steps in modern AI.

BuildersAI Foundations~17 min readIntermediateAdvancedResearcherBI3 · LearningBI2 · Representation & ReasoningPrint / PDF

Lesson map

What this lesson covers

28 min15 blocks3 concepts

Learning path

The main moves in order

1Most of the Internet Is Garbage
2quality filtering
3perplexity
4heuristics

Concept cluster

Terms to connect while reading

quality filteringperplexityheuristics

Sections4

Lists1

Notes4

Code1

Terms1

Section 1

Most of the Internet Is Garbage

If you scrape the open web, most of what you get is spam, SEO-generated junk, parked domains full of ads, auto-translated machine output, and boilerplate. Training a model on this raw mess makes a worse model than training on a smaller but cleaner subset.

Common filter techniques

1Language ID (keep only target languages)
2Length filters (drop pages under 50 words)
3Character-level rules (drop pages with too much punctuation)
4Repetition filters (drop pages where one line repeats 50 times)
5Perplexity filters (use a small LM to score how natural the text looks)
6Classifier-based filters (train a model on Wikipedia vs. random web)

Check-in 1. Got it so far?

The perplexity trick

Train a small language model on a known-good corpus like Wikipedia. Run it on any web page and compute its perplexity, a measure of how surprising the text is. Random spam has very high perplexity (unpredictable garbage). Genuine writing has low perplexity. Keep only the low-perplexity pages.

Classifier filters

Take a sample labeled as high-quality (curated lists, Wikipedia, reference books) and another sample of random web text. Train a binary classifier. Then run it on the full corpus and keep the pages it labels high-quality. GPT-3 famously used this approach with a simple logistic regression.

Check-in 2. Got it so far?

A GPT-3 style quality classifier

python

# Pseudocode for a simple quality classifier
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import LogisticRegression

# Labels: 1 = high quality, 0 = random web
X_train = [...]  # list of documents
y_train = [...]  # 1s for curated, 0s for random

vec = HashingVectorizer(n_features=2**18)
clf = LogisticRegression()
clf.fit(vec.transform(X_train), y_train)

# Score a new document
score = clf.predict_proba(vec.transform(['some new page']))[0, 1]
if score > 0.7:
    keep_this_page()

Key terms in this lesson

The big idea: quality filtering is where small labs and big labs actually differ. Anyone can download Common Crawl. Not everyone can filter it well.

Check-in 3. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Quality Filtering: Separating Signal From Noise”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Quality Filtering: Separating Signal From Noise

Most of the Internet Is Garbage

Common filter techniques

The perplexity trick

Classifier filters

Curious about “Quality Filtering: Separating Signal From Noise”?

Keep going

Quality Filtering: Separating Signal From Noise

Most of the Internet Is Garbage

Common filter techniques

The perplexity trick

Classifier filters

Curious about “Quality Filtering: Separating Signal From Noise”?

Keep going