Loading lesson…
The world's most influential 'leaderboard' for AI is not a test — it is humans voting blindly. Here is how that works.
Chatbot Arena, run by LMSYS Org (now often branded LMArena), is a website where you type a prompt, two anonymous models respond, and you vote on which is better. After millions of votes, a ranking emerges. It is harder to game than any fixed benchmark because the test set is whatever real people happen to ask.
Arena uses the Elo rating system from chess. Each model starts at 1000. When model A beats model B, A's score rises and B's falls, with the change scaled by how surprising the outcome was. Over millions of games, ratings converge to a stable ranking.
Simplified Elo update: Expected(A vs B) = 1 / (1 + 10^((Rb - Ra)/400)) New Ra = Ra + K * (actual - expected) K is usually 16-32. Beating a higher-rated opponent earns more points than beating a weaker one.Elo rating in one paragraph of mathWe collect over 100,000 pairwise votes to analyze the strengths and weaknesses of various LLMs.
— Chiang et al., LMSYS Chatbot Arena paper (2024)
The big idea: Arena measures what people like, not what is true. That makes it an excellent signal for chat assistants and a poor one for correctness-critical work.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-chatbot-arena
What is the main idea of "How Chatbot Arena Works"?
Which concept is most central to "How Chatbot Arena Works"?
Which use of AI fits this topic best?
What should a careful learner remember about "Categories matter"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about Chatbot Arena be treated?
Name one way to verify an AI answer about Chatbot Arena.
Which action would help you apply "How Chatbot Arena Works" responsibly?