Lesson 210 of 1596
Benchmark Contamination
When the test questions quietly end up in the training data, scores lie. Here is how it happens and how to catch it.
Creators · AI Foundations · ~23 min read
The Test That Was On the Internet
Modern language models train on scraped web data. If a benchmark has been public for a few years, the test questions — and often the answers — are almost certainly in the training set. That is contamination, also called data leakage or train-test overlap.
How contamination happens
- 1A benchmark paper is posted on arXiv with examples
- 2Discussion of the benchmark appears in blog posts, Stack Overflow, forums
- 3Training data scrapers pick up those pages
- 4The model later sees identical or near-identical questions at evaluation time
Detection methods
- N-gram search: check if the test prompt appears verbatim in training
- Membership inference: does the model 'recognize' seeing the item?
- Permutation tests: scramble the problem; if score drops a lot, it was memorizing
- Canary strings: benchmark creators embed hidden phrases to detect contamination
Mitigations
Compare the options
| Approach | How it helps | Limitation |
|---|---|---|
| Hold-out test sets | Never published publicly | Cannot be independently verified |
| Continuously refreshed benchmarks | New items added over time | Requires eval infrastructure |
| Expert-written tests (GPQA) | Designed to be Google-proof | Expensive to create |
| Benchmark decontamination | Filter training data | Only works if benchmark is known in advance |
“We evaluated the contamination of our evaluation sets and found that up to 15% of the benchmark instances appeared in an n-gram form in the training corpus.”
Key terms in this lesson
The big idea: contamination is the silent corruption of AI benchmarks. Newer, private, or expert-written tests are the trustworthy anchors.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “Benchmark Contamination”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 35 min
How Chatbot Arena Works
The world's most influential 'leaderboard' for AI is not a test — it is humans voting blindly. Here is how that works.
Creators · 32 min
Synthetic Data: When AI Trains on AI
Real data is expensive, private, or scarce. Synthetic data is generated by models themselves. It is rapidly becoming as important as scraped data.
Creators · 38 min
AP Computer Science A: Learning Java Without Cheating
AI writes Java for you faster than your teacher can say 'Scanner'. Using it without cheating yourself out of the class is the real skill.
