Tendril

Lesson 195 of 1570

Human Evaluation 101

Automatic metrics miss a lot. Humans catch what metrics cannot. Here is how to run a simple human eval.

BuildersAI Foundations~17 min readIntermediateAdvancedBI3 · LearningBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

28 min15 blocks3 concepts

Learning path

The main moves in order

1When Humans Are the Measuring Stick
2human evaluation
3rubric
4inter-rater agreement

Concept cluster

Terms to connect while reading

human evaluationrubricinter-rater agreement

Sections4

Lists2

Notes4

Quotes1

Terms1

Section 1

When Humans Are the Measuring Stick

For judging quality, creativity, and helpfulness, there is no substitute for human eyes. Human evaluation is expensive and slow — but when done well, it is the gold standard.

The minimum viable human eval

1Write a clear rubric (what 'good' means)
2Collect 30-50 prompts representative of real use
3Generate responses from the models you want to compare
4Show pairs blindly; ask which is better and why
5Have at least two raters; measure agreement
6Report with confidence intervals

Check-in 1. Got it so far?

Inter-rater agreement

If your two raters disagree on half the items, your rubric is broken or your task is noisy. Cohen's Kappa is a common agreement metric: above 0.6 is acceptable; above 0.8 is strong.

Cost and ethics

Amazon Mechanical Turk is cheap but quality varies; Prolific is more consistent
Pay at least $15-20/hour — low pay corrupts results
Warn about disturbing content before the session
Keep a feedback channel open — raters see bugs in your rubric

Check-in 2. Got it so far?

“There is no substitute for watching a human try to use your system.”
Common refrain in UX research

Key terms in this lesson

The big idea: human evaluation is the ultimate check. It is slow and expensive, so use it sparingly — but never ignore it.

Check-in 3. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Human Evaluation 101”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Human Evaluation 101

When Humans Are the Measuring Stick

The minimum viable human eval

Inter-rater agreement

Cost and ethics

Curious about “Human Evaluation 101”?

Keep going

Human Evaluation 101

When Humans Are the Measuring Stick

The minimum viable human eval

Inter-rater agreement

Cost and ethics

Curious about “Human Evaluation 101”?

Keep going