Loading lesson…
When the test questions quietly end up in the training data, scores lie. Here is how it happens and how to catch it.
Modern language models train on scraped web data. If a benchmark has been public for a few years, the test questions — and often the answers — are almost certainly in the training set. That is contamination, also called data leakage or train-test overlap.
| Approach | How it helps | Limitation |
|---|---|---|
| Hold-out test sets | Never published publicly | Cannot be independently verified |
| Continuously refreshed benchmarks | New items added over time | Requires eval infrastructure |
| Expert-written tests (GPQA) | Designed to be Google-proof | Expensive to create |
| Benchmark decontamination | Filter training data | Only works if benchmark is known in advance |
We evaluated the contamination of our evaluation sets and found that up to 15% of the benchmark instances appeared in an n-gram form in the training corpus.
— Touvron et al., Llama 2 technical report (2023)
The big idea: contamination is the silent corruption of AI benchmarks. Newer, private, or expert-written tests are the trustworthy anchors.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-benchmark-contamination
What is the main idea of "Benchmark Contamination"?
Which concept is most central to "Benchmark Contamination"?
Which use of AI fits this topic best?
What should a careful learner remember about "Even 'private' tests leak"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about contamination be treated?
Name one way to verify an AI answer about contamination.
Which action would help you apply "Benchmark Contamination" responsibly?