Lesson 251 of 2116
Benchmark Saturation
Why the benchmark that was state-of-the-art three years ago is now useless — and what that teaches about measuring AI.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1When Every Model Scores 99
- 2saturation
- 3ceiling
- 4benchmark lifecycle
Concept cluster
Terms to connect while reading
Section 1
When Every Model Scores 99
A benchmark is saturated when the top models cluster near the human ceiling and further gains are invisible in the single number. Saturation is actually a sign of progress — but it also means the benchmark has outlived its usefulness.
A timeline of saturation
Compare the options
| Benchmark | Launched | Saturated by | Why |
|---|---|---|---|
| ImageNet | 2009 | 2017 | Deep CNNs surpassed human-level accuracy |
| SQuAD 1.1 | 2016 | 2018 | BERT-class models matched human F1 |
| GLUE | 2018 | 2019 | Replaced by SuperGLUE almost immediately |
| HumanEval | 2021 | 2024 | Frontier models exceed 90% pass@1 |
| MMLU | 2020 | 2024 | Frontier models above 85-90% |
Three ways benchmarks saturate
- 1Ceiling effect: the task is actually solved
- 2Contamination: answers leak into training data
- 3Overfitting: models optimized specifically for this benchmark
How to tell something is saturated
- Top models all within 1-2 points of each other
- Errors are disproportionately label noise in the dataset itself
- Rankings become sensitive to prompt wording rather than model ability
- Human-written variants of the same tasks are much harder
“Benchmarks should be treated as disposable diagnostics, not as enduring definitions of progress.”
Key terms in this lesson
The big idea: saturation is the endgame of every benchmark. The art is in picking harder, fresher tests before the numbers become meaningless.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Benchmark Saturation”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Open vs. Closed Models: Philosophy and Strategy
Open-source AI is both a technical movement and a political one. Understand the arguments so you can pick a stack and defend it.
Creators · 55 min
The Three Ingredients: Data, Compute, Algorithms (Capstone)
Every AI breakthrough of the past decade rests on three interacting ingredients. Synthesize everything you have learned into one working model.
Creators · 35 min
Calculus with AI: Limits, Derivatives, and Not Getting Lost
Calculus is where a lot of smart students hit a wall. Wolfram|Alpha and Claude can walk you through every step, but only if you already did the setup work.
