Loading lesson…
If the same paragraph appears a million times in your training data, your model will memorize it. Deduplication quietly makes AI better.
The raw internet is full of copies. News articles get syndicated across hundreds of sites. License text appears in every open-source repo. A viral tweet gets quoted ten thousand times. If you train on raw Common Crawl without removing duplicates, your model sees the Wikipedia article on France not once, but a thousand times.
A 2021 paper, Deduplicating Training Data Makes Language Models Better by Lee et al., showed that aggressive deduplication reduces memorization of training text by 10x and slightly improves model quality. Every major lab now deduplicates.
| Type | Example | How to detect |
|---|---|---|
| Exact | The same file twice | Hash the bytes |
| Near | Two news sites with one word different | MinHash + Jaccard similarity |
| Near (paraphrased) | Article rewritten with synonyms | Embedding similarity |
| Semantic | Two explanations of the same concept | Harder, requires LLMs |
MinHash is a clever algorithm that estimates document similarity in sub-linear time. Break each document into n-grams (say, 5-word shingles), hash them, keep the minimum from each of many hash functions, and compare the minimums. It lets you find near-duplicates across a trillion-token dataset in hours instead of centuries.
from datasketch import MinHash, MinHashLSH
def minhash_for(text):
mh = MinHash(num_perm=128)
for word in set(text.lower().split()):
mh.update(word.encode('utf-8'))
return mh
lsh = MinHashLSH(threshold=0.8, num_perm=128)
docs = ['cats are fuzzy', 'cats are quite fuzzy', 'dogs bark loudly']
for i, text in enumerate(docs):
lsh.insert(f'doc_{i}', minhash_for(text))
# Find near-duplicates of doc 0
print(lsh.query(minhash_for(docs[0])))MinHash-LSH deduplication in PythonThe big idea: more unique data beats more total data. Deduplication is one of the quietest, most effective steps in modern AI.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-deduplication
What is the core idea behind "Deduplication: Why Repeats Hurt Models"?
Which term best describes a foundational idea in "Deduplication: Why Repeats Hurt Models"?
A learner studying Deduplication: Why Repeats Hurt Models would need to understand which concept?
Which of these is directly relevant to Deduplication: Why Repeats Hurt Models?
What is the key insight about "Why this is bad" in the context of Deduplication: Why Repeats Hurt Models?
What is the recommended tip about "Build your mental model" in the context of Deduplication: Why Repeats Hurt Models?
Which statement accurately describes an aspect of Deduplication: Why Repeats Hurt Models?
What does working with Deduplication: Why Repeats Hurt Models typically involve?
Which of the following is true about Deduplication: Why Repeats Hurt Models?
Which best describes the scope of "Deduplication: Why Repeats Hurt Models"?
Which section heading best belongs in a lesson about Deduplication: Why Repeats Hurt Models?
Which section heading best belongs in a lesson about Deduplication: Why Repeats Hurt Models?
Which section heading best belongs in a lesson about Deduplication: Why Repeats Hurt Models?
Which of the following is a concept covered in Deduplication: Why Repeats Hurt Models?
Which of the following is a concept covered in Deduplication: Why Repeats Hurt Models?