Lesson 6 of 1570
Benchmarks, Leaderboards, and Their Limits
Every new model claims a new high score. Before you trust a leaderboard, learn what benchmarks actually measure — and what they miss.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Why Benchmarks Exist
- 2benchmark
- 3evaluation
- 4contamination
Concept cluster
Terms to connect while reading
Section 1
Why Benchmarks Exist
A benchmark is a standardized test. The point is to make different models comparable. If every company used its own secret evaluation, you could never trust a claim like 30 percent better.
Popular benchmarks you will see
Compare the options
| Name | What it tests |
|---|---|
| MMLU | Broad knowledge across subjects like history, law, STEM |
| HumanEval | Coding correctness on short programming problems |
| GSM8K | Grade-school math word problems |
| HellaSwag | Common sense completion |
| ARC | Science reasoning at elementary level |
| GPQA | PhD-level science questions |
The contamination problem
Benchmark questions often end up in training data by accident. A model that has seen a test's answer key during training scores artificially high. This is called data contamination, and it is incredibly hard to detect.
What benchmarks cannot capture
- Long conversations across many turns
- Style, tone, and voice
- Judgment about when to refuse
- Usefulness on your actual tasks
- Cost, latency, and real-world reliability
Better evaluation approaches
- 1Private, held-out tests the model has never seen
- 2Real user task completion rates, not synthetic problems
- 3Side-by-side human preference comparisons
- 4Adversarial probes looking for specific weaknesses
- 5Domain-specific evals tied to your own use case
“When a measure becomes a target, it ceases to be a good measure.”
Key terms in this lesson
The big idea: benchmarks are a starting point, not a verdict. Treat scores like product packaging — useful, but never a substitute for trying the thing yourself on a job you actually care about.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Benchmarks, Leaderboards, and Their Limits”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 22 min
What a Benchmark Is and Why It Matters
Benchmarks are how AI progress gets measured. Understanding them is the first step in reading any AI claim.
Builders · 22 min
The Turing Test and Its Discontents
The imitation game became famous, but most AI researchers now think it measures the wrong thing.
Builders · 35 min
A Short History: From Expert Systems to Transformers
AI did not start in 2022. It has decades of wrong turns and breakthroughs. Knowing the history helps you spot hype from real progress.
