Lesson 257 of 2116
LLM-as-Judge: Promise and Pitfalls
Using one LLM to grade another is the cheapest human-like evaluation you can run. It is also full of traps.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1A Cheap Oracle
- 2LLM-as-judge
- 3grader
- 4position bias
Concept cluster
Terms to connect while reading
Section 1
A Cheap Oracle
LLM-as-judge (sometimes called LLM-as-a-Judge, or automated preference evaluation) uses a strong language model — often GPT-4 class or Claude Sonnet class — to score responses from other models. It has become the default for preference evaluation since 2023.
Where it works well
- Correlates ~80-85 percent with human judgment on most open-ended tasks
- Scales to thousands of comparisons for pennies
- Consistent — same prompt, same judgment
- Reproducible in research papers
Known biases
- 1Position bias: judges often prefer the response shown first
- 2Verbosity bias: longer responses are rated higher, even when worse
- 3Self-preference: GPT-4 prefers GPT-4 outputs; Claude prefers Claude
- 4Authority bias: responses starting with 'As an expert...' score higher
- 5Format bias: markdown, headers, and bullet points raise scores
Mitigations that help
Compare the options
| Bias | Mitigation |
|---|---|
| Position bias | Randomize order, or run both orderings and average |
| Verbosity bias | Include length in the rubric; ask judge to ignore it |
| Self-preference | Use a judge from a different model family |
| Format bias | Strip formatting before judging |
| Inconsistency | Average over 3-5 judge runs |
A judge prompt that tries to neutralize the obvious biases
A minimal LLM-as-judge prompt:
You are a careful evaluator.
Given a user question and two answers A and B,
decide which is more helpful, accurate, and honest.
Question: {{question}}
Answer A: {{answer_a}}
Answer B: {{answer_b}}
Ignore length and formatting.
Output exactly one of: A, B, or TIE.
Then give a one-sentence justification.“We find that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement.”
Key terms in this lesson
The big idea: LLM-as-judge is a powerful tool and a fragile tool. Use it with calibration, bias mitigations, and regular human spot-checks.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “LLM-as-Judge: Promise and Pitfalls”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
What Is Intelligence, Really? A Working Framework
Before we can judge whether an AI is intelligent, we need a framework for what intelligence even means. Draw on Chollet, Dennett, and modern evals.
Creators · 45 min
The Economics and Ethics of Training Data
Data is the strategic asset of AI. Understand the supply chain, the legal fight, and the philosophical stakes before you build anything on top.
Creators · 45 min
Emergence, Capability Forecasting, and Safety
Emergent abilities make AI both more exciting and more dangerous. How do labs forecast what the next model will do — and what happens when they are wrong?
