Lesson 208 of 1596
Elo Ratings for AI
Born in chess, now everywhere in AI evaluation. Learn why Elo works and where it quietly misleads.
Creators · AI Foundations · ~19 min read
A System From 1960 Chess
Arpad Elo invented his rating system in 1960 for the US Chess Federation. The math is a logistic curve: the probability player A beats player B is a smooth function of their rating difference. 400 points of gap means roughly a 91 percent win rate.
Key properties
- Only the difference matters, not the absolute rating
- Ratings update after every game, scaled by expectation
- Beating a stronger opponent earns more points
- Losses to weaker opponents cost more points
- Over many games, rating converges to a stable estimate
Where Elo breaks for AI
- 1Skill is not one-dimensional — a model great at coding and bad at poetry cannot be summarized as one number
- 2Non-transitive preferences exist (A beats B, B beats C, C beats A) and Elo cannot represent them
- 3Rating inflation as new strong models enter the pool
- 4Limited ability to compare models that never played each other
Compare the options
| Elo strength | Elo weakness |
|---|---|
| Simple to compute | Assumes single-dimensional skill |
| Updates online | Needs many games to stabilize |
| Human-interpretable | Ignores task differences |
| Widely familiar | Hides uncertainty in a single number |
“The rating system is not a moral judgment but a best-guess estimate of relative strength.”
Key terms in this lesson
The big idea: Elo is a compact, elegant way to rank competitors — but a single number hides a lot. Always look at the interval and the category breakdown.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “Elo Ratings for AI”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 28 min
Log-Scale Thinking: When Linear Lies
Some things grow multiplicatively, not additively. Log scales reveal patterns that linear scales hide, especially for anything related to scale or growth.
Builders · 30 min
Tokens and Embeddings: How AI Reads Words
AI does not read letters. It reads tokens, which live as vectors in a space of meaning. Learn how text becomes numbers you can do math on.
Creators · 45 min
Uncertainty Quantification in LLMs
A model that says 'I am 95 percent sure' and is wrong 40 percent of the time is miscalibrated. Measuring that gap is uncertainty quantification.
