Loading lesson…
Using one LLM to grade another is the cheapest human-like evaluation you can run. It is also full of traps.
LLM-as-judge (sometimes called LLM-as-a-Judge, or automated preference evaluation) uses a strong language model — often GPT-4 class or Claude Sonnet class — to score responses from other models. It has become the default for preference evaluation since 2023.
| Bias | Mitigation |
|---|---|
| Position bias | Randomize order, or run both orderings and average |
| Verbosity bias | Include length in the rubric; ask judge to ignore it |
| Self-preference | Use a judge from a different model family |
| Format bias | Strip formatting before judging |
| Inconsistency | Average over 3-5 judge runs |
A minimal LLM-as-judge prompt: You are a careful evaluator. Given a user question and two answers A and B, decide which is more helpful, accurate, and honest. Question: {{question}} Answer A: {{answer_a}} Answer B: {{answer_b}} Ignore length and formatting. Output exactly one of: A, B, or TIE. Then give a one-sentence justification.A judge prompt that tries to neutralize the obvious biasesWe find that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement.
— Zheng et al., MT-Bench paper (2023)
The big idea: LLM-as-judge is a powerful tool and a fragile tool. Use it with calibration, bias mitigations, and regular human spot-checks.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-llm-as-judge
What is the main idea of "LLM-as-Judge: Promise and Pitfalls"?
Which concept is most central to "LLM-as-Judge: Promise and Pitfalls"?
Which use of AI fits this topic best?
What should a careful learner remember about "Never judge your own homework"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about LLM-as-judge be treated?
Name one way to verify an AI answer about LLM-as-judge.
Which action would help you apply "LLM-as-Judge: Promise and Pitfalls" responsibly?