Lesson 252 of 2116
Benchmark Contamination
When the test questions quietly end up in the training data, scores lie. Here is how it happens and how to catch it.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Test That Was On the Internet
- 2contamination
- 3data leakage
- 4train-test overlap
Concept cluster
Terms to connect while reading
Section 1
The Test That Was On the Internet
Modern language models train on scraped web data. If a benchmark has been public for a few years, the test questions — and often the answers — are almost certainly in the training set. That is contamination, also called data leakage or train-test overlap.
How contamination happens
- 1A benchmark paper is posted on arXiv with examples
- 2Discussion of the benchmark appears in blog posts, Stack Overflow, forums
- 3Training data scrapers pick up those pages
- 4The model later sees identical or near-identical questions at evaluation time
Detection methods
- N-gram search: check if the test prompt appears verbatim in training
- Membership inference: does the model 'recognize' seeing the item?
- Permutation tests: scramble the problem; if score drops a lot, it was memorizing
- Canary strings: benchmark creators embed hidden phrases to detect contamination
Mitigations
Compare the options
| Approach | How it helps | Limitation |
|---|---|---|
| Hold-out test sets | Never published publicly | Cannot be independently verified |
| Continuously refreshed benchmarks | New items added over time | Requires eval infrastructure |
| Expert-written tests (GPQA) | Designed to be Google-proof | Expensive to create |
| Benchmark decontamination | Filter training data | Only works if benchmark is known in advance |
“We evaluated the contamination of our evaluation sets... and found that up to 15% of the benchmark instances appeared in an n-gram form in the training corpus.”
Key terms in this lesson
The big idea: contamination is the silent corruption of AI benchmarks. Newer, private, or expert-written tests are the trustworthy anchors.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Benchmark Contamination”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 35 min
How Chatbot Arena Works
The world's most influential 'leaderboard' for AI is not a test — it is humans voting blindly. Here is how that works.
Creators · 32 min
Synthetic Data: When AI Trains on AI
Real data is expensive, private, or scarce. Synthetic data is generated by models themselves. It is rapidly becoming as important as scraped data.
Creators · 38 min
AP Computer Science A: Learning Java Without Cheating
AI writes Java for you faster than your teacher can say 'Scanner'. Using it without cheating yourself out of the class is the real skill.
