Lesson 214 of 1570
Deduplication: Why Repeats Hurt Models
If the same paragraph appears a million times in your training data, your model will memorize it. Deduplication quietly makes AI better.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Internet Repeats Itself
- 2deduplication
- 3memorization
- 4MinHash
Concept cluster
Terms to connect while reading
Section 1
The Internet Repeats Itself
The raw internet is full of copies. News articles get syndicated across hundreds of sites. License text appears in every open-source repo. A viral tweet gets quoted ten thousand times. If you train on raw Common Crawl without removing duplicates, your model sees the Wikipedia article on France not once, but a thousand times.
Research that proved it matters
A 2021 paper, Deduplicating Training Data Makes Language Models Better by Lee et al., showed that aggressive deduplication reduces memorization of training text by 10x and slightly improves model quality. Every major lab now deduplicates.
Two flavors of duplication
Compare the options
| Type | Example | How to detect |
|---|---|---|
| Exact | The same file twice | Hash the bytes |
| Near | Two news sites with one word different | MinHash + Jaccard similarity |
| Near (paraphrased) | Article rewritten with synonyms | Embedding similarity |
| Semantic | Two explanations of the same concept | Harder, requires LLMs |
MinHash, the workhorse
MinHash is a clever algorithm that estimates document similarity in sub-linear time. Break each document into n-grams (say, 5-word shingles), hash them, keep the minimum from each of many hash functions, and compare the minimums. It lets you find near-duplicates across a trillion-token dataset in hours instead of centuries.
MinHash-LSH deduplication in Python
from datasketch import MinHash, MinHashLSH
def minhash_for(text):
mh = MinHash(num_perm=128)
for word in set(text.lower().split()):
mh.update(word.encode('utf-8'))
return mh
lsh = MinHashLSH(threshold=0.8, num_perm=128)
docs = ['cats are fuzzy', 'cats are quite fuzzy', 'dogs bark loudly']
for i, text in enumerate(docs):
lsh.insert(f'doc_{i}', minhash_for(text))
# Find near-duplicates of doc 0
print(lsh.query(minhash_for(docs[0])))Key terms in this lesson
The big idea: more unique data beats more total data. Deduplication is one of the quietest, most effective steps in modern AI.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Deduplication: Why Repeats Hurt Models”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 22 min
The Turing Test and Its Discontents
The imitation game became famous, but most AI researchers now think it measures the wrong thing.
Builders · 28 min
NotebookLM: Turning Your Notes Into a Study Buddy
Google's NotebookLM lets you upload textbooks, lectures, and notes, then chat with them. This is the most underrated study tool of 2026.
Builders · 28 min
Quality Filtering: Separating Signal From Noise
The raw web is 99 percent garbage. Filtering it down to the 1 percent worth training on is one of the highest-leverage steps in modern AI.
