Loading lesson…
If the same paragraph appears a million times in your training data, your model will memorize it. Deduplication quietly makes AI better.
The raw internet is full of copies. News articles get syndicated across hundreds of sites. License text appears in every open-source repo. A viral tweet gets quoted ten thousand times. If you train on raw Common Crawl without removing duplicates, your model sees the Wikipedia article on France not once, but a thousand times.
A 2021 paper, Deduplicating Training Data Makes Language Models Better by Lee et al., showed that aggressive deduplication reduces memorization of training text by 10x and slightly improves model quality. Every major lab now deduplicates.
| Type | Example | How to detect |
|---|---|---|
| Exact | The same file twice | Hash the bytes |
| Near | Two news sites with one word different | MinHash + Jaccard similarity |
| Near (paraphrased) | Article rewritten with synonyms | Embedding similarity |
| Semantic | Two explanations of the same concept | Harder, requires LLMs |
MinHash is a clever algorithm that estimates document similarity in sub-linear time. Break each document into n-grams (say, 5-word shingles), hash them, keep the minimum from each of many hash functions, and compare the minimums. It lets you find near-duplicates across a trillion-token dataset in hours instead of centuries.
from datasketch import MinHash, MinHashLSH def minhash_for(text): mh = MinHash(num_perm=128) for word in set(text.lower().split()): mh.update(word.encode('utf-8')) return mh lsh = MinHashLSH(threshold=0.8, num_perm=128) docs = ['cats are fuzzy', 'cats are quite fuzzy', 'dogs bark loudly'] for i, text in enumerate(docs): lsh.insert(f'doc_{i}', minhash_for(text)) # Find near-duplicates of doc 0 print(lsh.query(minhash_for(docs[0])))MinHash-LSH deduplication in PythonThe big idea: more unique data beats more total data. Deduplication is one of the quietest, most effective steps in modern AI.
6 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-deduplication
What is the main idea of "Deduplication: Why Repeats Hurt Models"?
Which concept is most central to "Deduplication: Why Repeats Hurt Models"?
What should a careful learner remember about "Why this is bad"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about deduplication be treated?
Name one way to verify an AI answer about deduplication.