Loading lesson…
When the test questions quietly end up in the training data, scores lie. Here is how it happens and how to catch it.
Modern language models train on scraped web data. If a benchmark has been public for a few years, the test questions — and often the answers — are almost certainly in the training set. That is contamination, also called data leakage or train-test overlap.
| Approach | How it helps | Limitation |
|---|---|---|
| Hold-out test sets | Never published publicly | Cannot be independently verified |
| Continuously refreshed benchmarks | New items added over time | Requires eval infrastructure |
| Expert-written tests (GPQA) | Designed to be Google-proof | Expensive to create |
| Benchmark decontamination | Filter training data | Only works if benchmark is known in advance |
We evaluated the contamination of our evaluation sets... and found that up to 15% of the benchmark instances appeared in an n-gram form in the training corpus.
— Touvron et al., Llama 2 technical report (2023)
The big idea: contamination is the silent corruption of AI benchmarks. Newer, private, or expert-written tests are the trustworthy anchors.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-benchmark-contamination
What is the core idea behind "Benchmark Contamination"?
Which term best describes a foundational idea in "Benchmark Contamination"?
A learner studying Benchmark Contamination would need to understand which concept?
Which of these is directly relevant to Benchmark Contamination?
Which of the following is a key point about Benchmark Contamination?
Which of these does NOT belong in a discussion of Benchmark Contamination?
Which statement is accurate regarding Benchmark Contamination?
Which of these does NOT belong in a discussion of Benchmark Contamination?
What is the key insight about "Even 'private' tests leak" in the context of Benchmark Contamination?
What is the key insight about "The honest report" in the context of Benchmark Contamination?
What is the recommended tip about "Ground your practice in fundamentals" in the context of Benchmark Contamination?
Which statement accurately describes an aspect of Benchmark Contamination?
What does working with Benchmark Contamination typically involve?
Which best describes the scope of "Benchmark Contamination"?
Which section heading best belongs in a lesson about Benchmark Contamination?