Lesson 256 of 2116
Why You Should Not Trust the Leaderboard
Leaderboards are compelling. They are also deeply misleading. Here is a checklist for real skepticism. In reality, leaderboards hide a stack of choices that can swing the ordering: prompt wording, sampling settings, number of attempts, which subset of the benchmark is reported.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Leaderboard Illusion
- 2leaderboard
- 3cherry-picking
- 4Goodhart
Concept cluster
Terms to connect while reading
Section 1
The Leaderboard Illusion
A clean numerical ranking feels like truth. In reality, leaderboards hide a stack of choices that can swing the ordering: prompt wording, sampling settings, number of attempts, which subset of the benchmark is reported.
Seven ways leaderboards mislead
- 1Different prompt templates across models
- 2Best-of-N sampling inflating single-shot numbers
- 3Selecting only the subsets favorable to your model
- 4Different answer-extraction methods
- 5Temperature settings tuned per model
- 6Excluded failure cases ('we removed examples that caused timeouts')
- 7Non-identical test splits across papers
A pre-flight checklist
- What is the prompt template?
- How many shots (0-shot, 5-shot, chain-of-thought)?
- Is this pass@1 or best-of-N?
- What sampling temperature?
- Is the full benchmark reported, or a subset?
- Was the same protocol used for competitors?
The Arena caveat
Even dynamic leaderboards like LMArena have issues. Style bias, category coverage, user demographics, and rating compression near the top all distort the picture. Arena is still the best we have for subjective quality — but still imperfect.
“Every benchmark is a map, not the territory. You drive the territory, not the map.”
Key terms in this lesson
The big idea: leaderboards are starting points for inquiry, not verdicts. The more confident the ranking looks, the more skeptical you should be.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Why You Should Not Trust the Leaderboard”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 25 min
Benchmarks, Leaderboards, and Their Limits
Every new model claims a new high score. Before you trust a leaderboard, learn what benchmarks actually measure — and what they miss.
Creators · 45 min
What Is Intelligence, Really? A Working Framework
Before we can judge whether an AI is intelligent, we need a framework for what intelligence even means. Draw on Chollet, Dennett, and modern evals.
Creators · 45 min
The Economics and Ethics of Training Data
Data is the strategic asset of AI. Understand the supply chain, the legal fight, and the philosophical stakes before you build anything on top.
