Loading lesson…
The raw web is 99 percent garbage. Filtering it down to the 1 percent worth training on is one of the highest-leverage steps in modern AI.
If you scrape the open web, most of what you get is spam, SEO-generated junk, parked domains full of ads, auto-translated machine output, and boilerplate. Training a model on this raw mess makes a worse model than training on a smaller but cleaner subset.
Train a small language model on a known-good corpus like Wikipedia. Run it on any web page and compute its perplexity, a measure of how surprising the text is. Random spam has very high perplexity (unpredictable garbage). Genuine writing has low perplexity. Keep only the low-perplexity pages.
Take a sample labeled as high-quality (curated lists, Wikipedia, reference books) and another sample of random web text. Train a binary classifier. Then run it on the full corpus and keep the pages it labels high-quality. GPT-3 famously used this approach with a simple logistic regression.
# Pseudocode for a simple quality classifier
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import LogisticRegression
# Labels: 1 = high quality, 0 = random web
X_train = [...] # list of documents
y_train = [...] # 1s for curated, 0s for random
vec = HashingVectorizer(n_features=2**18)
clf = LogisticRegression()
clf.fit(vec.transform(X_train), y_train)
# Score a new document
score = clf.predict_proba(vec.transform(['some new page']))[0, 1]
if score > 0.7:
keep_this_page()A GPT-3 style quality classifierThe big idea: quality filtering is where small labs and big labs actually differ. Anyone can download Common Crawl. Not everyone can filter it well.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-quality-filtering
What is the core idea behind "Quality Filtering: Separating Signal From Noise"?
Which term best describes a foundational idea in "Quality Filtering: Separating Signal From Noise"?
A learner studying Quality Filtering: Separating Signal From Noise would need to understand which concept?
Which of these is directly relevant to Quality Filtering: Separating Signal From Noise?
Which of the following is a key point about Quality Filtering: Separating Signal From Noise?
Which of these does NOT belong in a discussion of Quality Filtering: Separating Signal From Noise?
What is the key insight about "Classic result" in the context of Quality Filtering: Separating Signal From Noise?
What is the key insight about "The quality paradox" in the context of Quality Filtering: Separating Signal From Noise?
Which statement accurately describes an aspect of Quality Filtering: Separating Signal From Noise?
What does working with Quality Filtering: Separating Signal From Noise typically involve?
Which of the following is true about Quality Filtering: Separating Signal From Noise?
Which best describes the scope of "Quality Filtering: Separating Signal From Noise"?
Which section heading best belongs in a lesson about Quality Filtering: Separating Signal From Noise?
Which section heading best belongs in a lesson about Quality Filtering: Separating Signal From Noise?
Which section heading best belongs in a lesson about Quality Filtering: Separating Signal From Noise?