Loading lesson…
The raw web is 99 percent garbage. Filtering it down to the 1 percent worth training on is one of the highest-leverage steps in modern AI.
If you scrape the open web, most of what you get is spam, SEO-generated junk, parked domains full of ads, auto-translated machine output, and boilerplate. Training a model on this raw mess makes a worse model than training on a smaller but cleaner subset.
Train a small language model on a known-good corpus like Wikipedia. Run it on any web page and compute its perplexity, a measure of how surprising the text is. Random spam has very high perplexity (unpredictable garbage). Genuine writing has low perplexity. Keep only the low-perplexity pages.
Take a sample labeled as high-quality (curated lists, Wikipedia, reference books) and another sample of random web text. Train a binary classifier. Then run it on the full corpus and keep the pages it labels high-quality. GPT-3 famously used this approach with a simple logistic regression.
# Pseudocode for a simple quality classifier from sklearn.feature_extraction.text import HashingVectorizer from sklearn.linear_model import LogisticRegression # Labels: 1 = high quality, 0 = random web X_train = [] # list of documents y_train = [] # 1s for curated, 0s for random vec = HashingVectorizer(n_features=2**18) clf = LogisticRegression() clf.fit(vec.transform(X_train), y_train) # Score a new document score = clf.predict_proba(vec.transform(['some new page']))[0, 1] if score > 0.7: keep_this_page()A GPT-3 style quality classifierThe big idea: quality filtering is where small labs and big labs actually differ. Anyone can download Common Crawl. Not everyone can filter it well.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-quality-filtering
What is the main idea of "Quality Filtering: Separating Signal From Noise"?
Which concept is most central to "Quality Filtering: Separating Signal From Noise"?
Which use of AI fits this topic best?
What should a careful learner remember about "Classic result"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about quality filtering be treated?
Name one way to verify an AI answer about quality filtering.
Which action would help you apply "Quality Filtering: Separating Signal From Noise" responsibly?