Loading lesson…
Using one LLM to grade another is the cheapest human-like evaluation you can run. It is also full of traps.
LLM-as-judge (sometimes called LLM-as-a-Judge, or automated preference evaluation) uses a strong language model — often GPT-4 class or Claude Sonnet class — to score responses from other models. It has become the default for preference evaluation since 2023.
| Bias | Mitigation |
|---|---|
| Position bias | Randomize order, or run both orderings and average |
| Verbosity bias | Include length in the rubric; ask judge to ignore it |
| Self-preference | Use a judge from a different model family |
| Format bias | Strip formatting before judging |
| Inconsistency | Average over 3-5 judge runs |
A minimal LLM-as-judge prompt:
You are a careful evaluator.
Given a user question and two answers A and B,
decide which is more helpful, accurate, and honest.
Question: {{question}}
Answer A: {{answer_a}}
Answer B: {{answer_b}}
Ignore length and formatting.
Output exactly one of: A, B, or TIE.
Then give a one-sentence justification.A judge prompt that tries to neutralize the obvious biasesWe find that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement.
— Zheng et al., MT-Bench paper (2023)
The big idea: LLM-as-judge is a powerful tool and a fragile tool. Use it with calibration, bias mitigations, and regular human spot-checks.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-llm-as-judge
What is the core idea behind "LLM-as-Judge: Promise and Pitfalls"?
Which term best describes a foundational idea in "LLM-as-Judge: Promise and Pitfalls"?
A learner studying LLM-as-Judge: Promise and Pitfalls would need to understand which concept?
Which of these is directly relevant to LLM-as-Judge: Promise and Pitfalls?
Which of the following is a key point about LLM-as-Judge: Promise and Pitfalls?
Which of these does NOT belong in a discussion of LLM-as-Judge: Promise and Pitfalls?
Which statement is accurate regarding LLM-as-Judge: Promise and Pitfalls?
Which of these does NOT belong in a discussion of LLM-as-Judge: Promise and Pitfalls?
What is the key insight about "Never judge your own homework" in the context of LLM-as-Judge: Promise and Pitfalls?
What is the key insight about "Calibration with humans" in the context of LLM-as-Judge: Promise and Pitfalls?
What is the recommended tip about "Ground your practice in fundamentals" in the context of LLM-as-Judge: Promise and Pitfalls?
Which statement accurately describes an aspect of LLM-as-Judge: Promise and Pitfalls?
What does working with LLM-as-Judge: Promise and Pitfalls typically involve?
Which best describes the scope of "LLM-as-Judge: Promise and Pitfalls"?
Which section heading best belongs in a lesson about LLM-as-Judge: Promise and Pitfalls?