Tendril

Lesson 257 of 2116

LLM-as-Judge: Promise and Pitfalls

Using one LLM to grade another is the cheapest human-like evaluation you can run. It is also full of traps.

CreatorsAI Foundations~24 min readAdvancedBI3 · LearningBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

40 min16 blocks4 concepts

Learning path

The main moves in order

1A Cheap Oracle
2LLM-as-judge
3grader
4position bias

Concept cluster

Terms to connect while reading

LLM-as-judgegraderposition biasself-preference

Sections4

Lists2

Notes4

Code1

Compare1

Section 1

A Cheap Oracle

LLM-as-judge (sometimes called LLM-as-a-Judge, or automated preference evaluation) uses a strong language model — often GPT-4 class or Claude Sonnet class — to score responses from other models. It has become the default for preference evaluation since 2023.

Where it works well

Correlates ~80-85 percent with human judgment on most open-ended tasks
Scales to thousands of comparisons for pennies
Consistent — same prompt, same judgment
Reproducible in research papers

Known biases

1Position bias: judges often prefer the response shown first
2Verbosity bias: longer responses are rated higher, even when worse
3Self-preference: GPT-4 prefers GPT-4 outputs; Claude prefers Claude
4Authority bias: responses starting with 'As an expert...' score higher
5Format bias: markdown, headers, and bullet points raise scores

Check-in 1. Got it so far?

Mitigations that help

Compare the options

Bias	Mitigation
Position bias	Randomize order, or run both orderings and average
Verbosity bias	Include length in the rubric; ask judge to ignore it
Self-preference	Use a judge from a different model family
Format bias	Strip formatting before judging
Inconsistency	Average over 3-5 judge runs

A judge prompt that tries to neutralize the obvious biases

text

A minimal LLM-as-judge prompt:

You are a careful evaluator.
Given a user question and two answers A and B,
decide which is more helpful, accurate, and honest.

Question: {{question}}
Answer A: {{answer_a}}
Answer B: {{answer_b}}

Ignore length and formatting.
Output exactly one of: A, B, or TIE.
Then give a one-sentence justification.

Check-in 2. Got it so far?

“We find that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement.”
Zheng et al., MT-Bench paper (2023)

Key terms in this lesson

Check-in 3. Got it so far?

The big idea: LLM-as-judge is a powerful tool and a fragile tool. Use it with calibration, bias mitigations, and regular human spot-checks.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “LLM-as-Judge: Promise and Pitfalls”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

LLM-as-Judge: Promise and Pitfalls

A Cheap Oracle

Where it works well

Known biases

Mitigations that help

Curious about “LLM-as-Judge: Promise and Pitfalls”?

Keep going

LLM-as-Judge: Promise and Pitfalls

A Cheap Oracle

Where it works well

Known biases

Mitigations that help

Curious about “LLM-as-Judge: Promise and Pitfalls”?

Keep going