Tendril

Lesson 252 of 2116

Benchmark Contamination

When the test questions quietly end up in the training data, scores lie. Here is how it happens and how to catch it.

CreatorsAI Foundations~23 min readAdvancedCoderBI3 · LearningBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

38 min15 blocks3 concepts

Learning path

The main moves in order

1The Test That Was On the Internet
2contamination
3data leakage
4train-test overlap

Concept cluster

Terms to connect while reading

contaminationdata leakagetrain-test overlap

Sections4

Lists2

Notes4

Compare1

Quotes1

Section 1

The Test That Was On the Internet

Modern language models train on scraped web data. If a benchmark has been public for a few years, the test questions — and often the answers — are almost certainly in the training set. That is contamination, also called data leakage or train-test overlap.

How contamination happens

1A benchmark paper is posted on arXiv with examples
2Discussion of the benchmark appears in blog posts, Stack Overflow, forums
3Training data scrapers pick up those pages
4The model later sees identical or near-identical questions at evaluation time

Check-in 1. Got it so far?

Detection methods

N-gram search: check if the test prompt appears verbatim in training
Membership inference: does the model 'recognize' seeing the item?
Permutation tests: scramble the problem; if score drops a lot, it was memorizing
Canary strings: benchmark creators embed hidden phrases to detect contamination

Mitigations

Compare the options

Approach	How it helps	Limitation
Hold-out test sets	Never published publicly	Cannot be independently verified
Continuously refreshed benchmarks	New items added over time	Requires eval infrastructure
Expert-written tests (GPQA)	Designed to be Google-proof	Expensive to create
Benchmark decontamination	Filter training data	Only works if benchmark is known in advance

Check-in 2. Got it so far?

“We evaluated the contamination of our evaluation sets... and found that up to 15% of the benchmark instances appeared in an n-gram form in the training corpus.”
Touvron et al., Llama 2 technical report (2023)

Key terms in this lesson

The big idea: contamination is the silent corruption of AI benchmarks. Newer, private, or expert-written tests are the trustworthy anchors.

Check-in 3. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Benchmark Contamination”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Benchmark Contamination

The Test That Was On the Internet

How contamination happens

Detection methods

Mitigations

Curious about “Benchmark Contamination”?

Keep going

Benchmark Contamination

The Test That Was On the Internet

How contamination happens

Detection methods

Mitigations

Curious about “Benchmark Contamination”?

Keep going