Deduplication: Why Repeats Hurt Models

If the same paragraph appears a million times in your training data, your model will memorize it. Deduplication quietly makes AI better.

25 min · Reviewed 2026

The Internet Repeats Itself

The raw internet is full of copies. News articles get syndicated across hundreds of sites. License text appears in every open-source repo. A viral tweet gets quoted ten thousand times. If you train on raw Common Crawl without removing duplicates, your model sees the Wikipedia article on France not once, but a thousand times.

Research that proved it matters

A 2021 paper, Deduplicating Training Data Makes Language Models Better by Lee et al., showed that aggressive deduplication reduces memorization of training text by 10x and slightly improves model quality. Every major lab now deduplicates.

Two flavors of duplication

Type	Example	How to detect
Exact	The same file twice	Hash the bytes
Near	Two news sites with one word different	MinHash + Jaccard similarity
Near (paraphrased)	Article rewritten with synonyms	Embedding similarity
Semantic	Two explanations of the same concept	Harder, requires LLMs

MinHash, the workhorse

MinHash is a clever algorithm that estimates document similarity in sub-linear time. Break each document into n-grams (say, 5-word shingles), hash them, keep the minimum from each of many hash functions, and compare the minimums. It lets you find near-duplicates across a trillion-token dataset in hours instead of centuries.

from datasketch import MinHash, MinHashLSH

def minhash_for(text):
    mh = MinHash(num_perm=128)
    for word in set(text.lower().split()):
        mh.update(word.encode('utf-8'))
    return mh

lsh = MinHashLSH(threshold=0.8, num_perm=128)
docs = ['cats are fuzzy', 'cats are quite fuzzy', 'dogs bark loudly']

for i, text in enumerate(docs):
    lsh.insert(f'doc_{i}', minhash_for(text))

# Find near-duplicates of doc 0
print(lsh.query(minhash_for(docs[0])))MinHash-LSH deduplication in Python

The big idea: more unique data beats more total data. Deduplication is one of the quietest, most effective steps in modern AI.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-deduplication

What is the core idea behind "Deduplication: Why Repeats Hurt Models"?
1. If the same paragraph appears a million times in your training data, your model will memorize it. Deduplication quietly makes AI better.
2. GPTBot (OpenAI, launched August 2023) — respects robots.txt
3. If it is real, ask whether your analysis goal can tolerate extreme values
4. Reconciling different vocabularies (USA vs. US vs. United States)
Which term best describes a foundational idea in "Deduplication: Why Repeats Hurt Models"?
1. MinHash
2. deduplication
3. LSH
4. memorization
A learner studying Deduplication: Why Repeats Hurt Models would need to understand which concept?
1. deduplication
2. LSH
3. MinHash
4. memorization
Which of these is directly relevant to Deduplication: Why Repeats Hurt Models?
1. deduplication
2. MinHash
3. memorization
4. LSH
What is the key insight about "Why this is bad" in the context of Deduplication: Why Repeats Hurt Models?
1. Duplicated training data teaches models to memorize exact strings.
2. GPTBot (OpenAI, launched August 2023) — respects robots.txt
3. If it is real, ask whether your analysis goal can tolerate extreme values
4. Reconciling different vocabularies (USA vs. US vs. United States)
What is the recommended tip about "Build your mental model" in the context of Deduplication: Why Repeats Hurt Models?
1. GPTBot (OpenAI, launched August 2023) — respects robots.txt
2. AI isn't magic — it's pattern recognition at scale. The more you understand how it works, the more effectively you can u…
3. If it is real, ask whether your analysis goal can tolerate extreme values
4. Reconciling different vocabularies (USA vs. US vs. United States)
Which statement accurately describes an aspect of Deduplication: Why Repeats Hurt Models?
1. GPTBot (OpenAI, launched August 2023) — respects robots.txt
2. If it is real, ask whether your analysis goal can tolerate extreme values
3. The raw internet is full of copies. News articles get syndicated across hundreds of sites. License text appears in every open-source repo.
4. Reconciling different vocabularies (USA vs. US vs. United States)
What does working with Deduplication: Why Repeats Hurt Models typically involve?
1. GPTBot (OpenAI, launched August 2023) — respects robots.txt
2. If it is real, ask whether your analysis goal can tolerate extreme values
3. Reconciling different vocabularies (USA vs. US vs. United States)
4. A 2021 paper, Deduplicating Training Data Makes Language Models Better by Lee et al.
Which of the following is true about Deduplication: Why Repeats Hurt Models?
1. MinHash is a clever algorithm that estimates document similarity in sub-linear time.
2. GPTBot (OpenAI, launched August 2023) — respects robots.txt
3. If it is real, ask whether your analysis goal can tolerate extreme values
4. Reconciling different vocabularies (USA vs. US vs. United States)
Which best describes the scope of "Deduplication: Why Repeats Hurt Models"?
1. It is unrelated to foundations workflows
2. It focuses on If the same paragraph appears a million times in your training data, your model will memorize it. De
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Deduplication: Why Repeats Hurt Models?
1. GPTBot (OpenAI, launched August 2023) — respects robots.txt
2. If it is real, ask whether your analysis goal can tolerate extreme values
3. Research that proved it matters
4. Reconciling different vocabularies (USA vs. US vs. United States)
Which section heading best belongs in a lesson about Deduplication: Why Repeats Hurt Models?
1. GPTBot (OpenAI, launched August 2023) — respects robots.txt
2. If it is real, ask whether your analysis goal can tolerate extreme values
3. Reconciling different vocabularies (USA vs. US vs. United States)
4. Two flavors of duplication
Which section heading best belongs in a lesson about Deduplication: Why Repeats Hurt Models?
1. MinHash, the workhorse
2. GPTBot (OpenAI, launched August 2023) — respects robots.txt
3. If it is real, ask whether your analysis goal can tolerate extreme values
4. Reconciling different vocabularies (USA vs. US vs. United States)
Which of the following is a concept covered in Deduplication: Why Repeats Hurt Models?
1. MinHash
2. deduplication
3. LSH
4. memorization
Which of the following is a concept covered in Deduplication: Why Repeats Hurt Models?
1. deduplication
2. LSH
3. MinHash
4. memorization

← Back to interactive lesson

Tendril · Builders · AI Foundations