Quality Filtering: Separating Signal From Noise

The raw web is 99 percent garbage. Filtering it down to the 1 percent worth training on is one of the highest-leverage steps in modern AI.

28 min · Reviewed 2026

Most of the Internet Is Garbage

If you scrape the open web, most of what you get is spam, SEO-generated junk, parked domains full of ads, auto-translated machine output, and boilerplate. Training a model on this raw mess makes a worse model than training on a smaller but cleaner subset.

Common filter techniques

Language ID (keep only target languages)
Length filters (drop pages under 50 words)
Character-level rules (drop pages with too much punctuation)
Repetition filters (drop pages where one line repeats 50 times)
Perplexity filters (use a small LM to score how natural the text looks)
Classifier-based filters (train a model on Wikipedia vs. random web)

The perplexity trick

Train a small language model on a known-good corpus like Wikipedia. Run it on any web page and compute its perplexity, a measure of how surprising the text is. Random spam has very high perplexity (unpredictable garbage). Genuine writing has low perplexity. Keep only the low-perplexity pages.

Classifier filters

Take a sample labeled as high-quality (curated lists, Wikipedia, reference books) and another sample of random web text. Train a binary classifier. Then run it on the full corpus and keep the pages it labels high-quality. GPT-3 famously used this approach with a simple logistic regression.

# Pseudocode for a simple quality classifier from sklearn.feature_extraction.text import HashingVectorizer from sklearn.linear_model import LogisticRegression # Labels: 1 = high quality, 0 = random web X_train = [] # list of documents y_train = [] # 1s for curated, 0s for random vec = HashingVectorizer(n_features=2**18) clf = LogisticRegression() clf.fit(vec.transform(X_train), y_train) # Score a new document score = clf.predict_proba(vec.transform(['some new page']))[0, 1] if score > 0.7: keep_this_page()A GPT-3 style quality classifier

The big idea: quality filtering is where small labs and big labs actually differ. Anyone can download Common Crawl. Not everyone can filter it well.

End-of-lesson check

8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-quality-filtering

What is the main idea of "Quality Filtering: Separating Signal From Noise"?
1. The raw web is 99 percent garbage. Filtering it down to the 1 percent worth training on is one of the highest-leverage steps in modern AI.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Quality Filtering: Separating Signal From Noise"?
1. perplexity
2. quality filtering
3. heuristics
4. quality filter
Which use of AI fits this topic best?
1. Let the AI decide what matters without your review
2. Use the answer before checking whether it fits the situation
3. Language ID (keep only target languages)
4. Use the first answer without checking it
What should a careful learner remember about "Classic result"?
1. Use AI to draft or organize ideas about quality filtering, then verify before acting.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. Use the AI answer as a draft, then check it against a reliable source.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about quality filtering be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about quality filtering.
Which action would help you apply "Quality Filtering: Separating Signal From Noise" responsibly?
1. Use the tool to avoid thinking through the tradeoff
2. Keep going even if the output conflicts with a trusted source
3. Use the first answer without checking it
4. Length filters (drop pages under 50 words)

← Back to interactive lesson

Tendril · Builders · AI Foundations

Quality Filtering: Separating Signal From Noise

The raw web is 99 percent garbage. Filtering it down to the 1 percent worth training on is one of the highest-leverage steps in modern AI.

28 min · Reviewed 2026

Most of the Internet Is Garbage

Common filter techniques

Language ID (keep only target languages)
Length filters (drop pages under 50 words)
Character-level rules (drop pages with too much punctuation)
Repetition filters (drop pages where one line repeats 50 times)
Perplexity filters (use a small LM to score how natural the text looks)
Classifier-based filters (train a model on Wikipedia vs. random web)

The perplexity trick

Classifier filters

# Pseudocode for a simple quality classifier from sklearn.feature_extraction.text import HashingVectorizer from sklearn.linear_model import LogisticRegression # Labels: 1 = high quality, 0 = random web X_train = [] # list of documents y_train = [] # 1s for curated, 0s for random vec = HashingVectorizer(n_features=2**18) clf = LogisticRegression() clf.fit(vec.transform(X_train), y_train) # Score a new document score = clf.predict_proba(vec.transform(['some new page']))[0, 1] if score > 0.7: keep_this_page()A GPT-3 style quality classifier

The big idea: quality filtering is where small labs and big labs actually differ. Anyone can download Common Crawl. Not everyone can filter it well.

End-of-lesson check

8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-quality-filtering

What is the main idea of "Quality Filtering: Separating Signal From Noise"?
1. The raw web is 99 percent garbage. Filtering it down to the 1 percent worth training on is one of the highest-leverage steps in modern AI.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Quality Filtering: Separating Signal From Noise"?
1. perplexity
2. quality filtering
3. heuristics
4. quality filter
Which use of AI fits this topic best?
1. Let the AI decide what matters without your review
2. Use the answer before checking whether it fits the situation
3. Language ID (keep only target languages)
4. Use the first answer without checking it
What should a careful learner remember about "Classic result"?
1. Use AI to draft or organize ideas about quality filtering, then verify before acting.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. Use the AI answer as a draft, then check it against a reliable source.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about quality filtering be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about quality filtering.
Which action would help you apply "Quality Filtering: Separating Signal From Noise" responsibly?
1. Use the tool to avoid thinking through the tradeoff
2. Keep going even if the output conflicts with a trusted source
3. Use the first answer without checking it
4. Length filters (drop pages under 50 words)

← Back to interactive lesson