Lesson 250 of 2116
Elo Ratings for AI
Born in chess, now everywhere in AI evaluation. Learn why Elo works and where it quietly misleads.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1A System From 1960 Chess
- 2Elo
- 3pairwise comparison
- 4logistic curve
Concept cluster
Terms to connect while reading
Section 1
A System From 1960 Chess
Arpad Elo invented his rating system in 1960 for the US Chess Federation. The math is a logistic curve: the probability player A beats player B is a smooth function of their rating difference. 400 points of gap means roughly a 91 percent win rate.
Key properties
- Only the difference matters, not the absolute rating
- Ratings update after every game, scaled by expectation
- Beating a stronger opponent earns more points
- Losses to weaker opponents cost more points
- Over many games, rating converges to a stable estimate
Where Elo breaks for AI
- 1Skill is not one-dimensional — a model great at coding and bad at poetry cannot be summarized as one number
- 2Non-transitive preferences exist (A beats B, B beats C, C beats A) and Elo cannot represent them
- 3Rating inflation as new strong models enter the pool
- 4Limited ability to compare models that never played each other
Compare the options
| Elo strength | Elo weakness |
|---|---|
| Simple to compute | Assumes single-dimensional skill |
| Updates online | Needs many games to stabilize |
| Human-interpretable | Ignores task differences |
| Widely familiar | Hides uncertainty in a single number |
“The rating system is not a moral judgment but a best-guess estimate of relative strength.”
Key terms in this lesson
The big idea: Elo is a compact, elegant way to rank competitors — but a single number hides a lot. Always look at the interval and the category breakdown.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Elo Ratings for AI”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 28 min
Log-Scale Thinking: When Linear Lies
Some things grow multiplicatively, not additively. Log scales reveal patterns that linear scales hide, especially for anything related to scale or growth.
Builders · 30 min
Tokens and Embeddings: How AI Reads Words
AI does not read letters. It reads tokens, which live as vectors in a space of meaning. Learn how text becomes numbers you can do math on.
Creators · 45 min
Uncertainty Quantification in LLMs
A model that says 'I am 95 percent sure' and is wrong 40 percent of the time is miscalibrated. Measuring that gap is uncertainty quantification.
