Tendril

Lesson 214 of 1570

Deduplication: Why Repeats Hurt Models

If the same paragraph appears a million times in your training data, your model will memorize it. Deduplication quietly makes AI better.

BuildersAI Foundations~15 min readIntermediateAdvancedResearcherBI3 · LearningBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

25 min14 blocks3 concepts

Learning path

The main moves in order

1The Internet Repeats Itself
2deduplication
3memorization
4MinHash

Concept cluster

Terms to connect while reading

deduplicationmemorizationMinHash

Sections4

Notes3

Code1

Compare1

Terms1

Section 1

The Internet Repeats Itself

The raw internet is full of copies. News articles get syndicated across hundreds of sites. License text appears in every open-source repo. A viral tweet gets quoted ten thousand times. If you train on raw Common Crawl without removing duplicates, your model sees the Wikipedia article on France not once, but a thousand times.

Research that proved it matters

A 2021 paper, Deduplicating Training Data Makes Language Models Better by Lee et al., showed that aggressive deduplication reduces memorization of training text by 10x and slightly improves model quality. Every major lab now deduplicates.

Check-in 1. Got it so far?

Two flavors of duplication

Compare the options

Type	Example	How to detect
Exact	The same file twice	Hash the bytes
Near	Two news sites with one word different	MinHash + Jaccard similarity
Near (paraphrased)	Article rewritten with synonyms	Embedding similarity
Semantic	Two explanations of the same concept	Harder, requires LLMs

MinHash, the workhorse

MinHash is a clever algorithm that estimates document similarity in sub-linear time. Break each document into n-grams (say, 5-word shingles), hash them, keep the minimum from each of many hash functions, and compare the minimums. It lets you find near-duplicates across a trillion-token dataset in hours instead of centuries.

MinHash-LSH deduplication in Python

python

from datasketch import MinHash, MinHashLSH

def minhash_for(text):
    mh = MinHash(num_perm=128)
    for word in set(text.lower().split()):
        mh.update(word.encode('utf-8'))
    return mh

lsh = MinHashLSH(threshold=0.8, num_perm=128)
docs = ['cats are fuzzy', 'cats are quite fuzzy', 'dogs bark loudly']

for i, text in enumerate(docs):
    lsh.insert(f'doc_{i}', minhash_for(text))

# Find near-duplicates of doc 0
print(lsh.query(minhash_for(docs[0])))

Check-in 2. Got it so far?

Key terms in this lesson

Check-in 3. Got it so far?

The big idea: more unique data beats more total data. Deduplication is one of the quietest, most effective steps in modern AI.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Deduplication: Why Repeats Hurt Models”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Deduplication: Why Repeats Hurt Models

The Internet Repeats Itself

Research that proved it matters

Two flavors of duplication

MinHash, the workhorse

Curious about “Deduplication: Why Repeats Hurt Models”?

Keep going

Deduplication: Why Repeats Hurt Models

The Internet Repeats Itself

Research that proved it matters

Two flavors of duplication

MinHash, the workhorse

Curious about “Deduplication: Why Repeats Hurt Models”?

Keep going