Lesson 195 of 1570
Human Evaluation 101
Automatic metrics miss a lot. Humans catch what metrics cannot. Here is how to run a simple human eval.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1When Humans Are the Measuring Stick
- 2human evaluation
- 3rubric
- 4inter-rater agreement
Concept cluster
Terms to connect while reading
Section 1
When Humans Are the Measuring Stick
For judging quality, creativity, and helpfulness, there is no substitute for human eyes. Human evaluation is expensive and slow — but when done well, it is the gold standard.
The minimum viable human eval
- 1Write a clear rubric (what 'good' means)
- 2Collect 30-50 prompts representative of real use
- 3Generate responses from the models you want to compare
- 4Show pairs blindly; ask which is better and why
- 5Have at least two raters; measure agreement
- 6Report with confidence intervals
Inter-rater agreement
If your two raters disagree on half the items, your rubric is broken or your task is noisy. Cohen's Kappa is a common agreement metric: above 0.6 is acceptable; above 0.8 is strong.
Cost and ethics
- Amazon Mechanical Turk is cheap but quality varies; Prolific is more consistent
- Pay at least $15-20/hour — low pay corrupts results
- Warn about disturbing content before the session
- Keep a feedback channel open — raters see bugs in your rubric
“There is no substitute for watching a human try to use your system.”
Key terms in this lesson
The big idea: human evaluation is the ultimate check. It is slow and expensive, so use it sparingly — but never ignore it.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Human Evaluation 101”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 25 min
Benchmarks, Leaderboards, and Their Limits
Every new model claims a new high score. Before you trust a leaderboard, learn what benchmarks actually measure — and what they miss.
Builders · 35 min
A Short History: From Expert Systems to Transformers
AI did not start in 2022. It has decades of wrong turns and breakthroughs. Knowing the history helps you spot hype from real progress.
Builders · 22 min
Papers With Code and Reproducibility
A paper without code is often a paper without truth. Papers With Code links claims to runnable proof. Where Claims Meet Code Papers With Code is a community-maintained site that pairs AI papers with their open-source implementations and benchmark results.
