Human Evaluation 101

Automatic metrics miss a lot. Humans catch what metrics cannot. Here is how to run a simple human eval.

28 min · Reviewed 2026

When Humans Are the Measuring Stick

For judging quality, creativity, and helpfulness, there is no substitute for human eyes. Human evaluation is expensive and slow — but when done well, it is the gold standard.

The minimum viable human eval

Write a clear rubric (what 'good' means)
Collect 30-50 prompts representative of real use
Generate responses from the models you want to compare
Show pairs blindly; ask which is better and why
Have at least two raters; measure agreement
Report with confidence intervals

Inter-rater agreement

If your two raters disagree on half the items, your rubric is broken or your task is noisy. Cohen's Kappa is a common agreement metric: above 0.6 is acceptable; above 0.8 is strong.

Cost and ethics

Amazon Mechanical Turk is cheap but quality varies; Prolific is more consistent
Pay at least $15-20/hour — low pay corrupts results
Warn about disturbing content before the session
Keep a feedback channel open — raters see bugs in your rubric

There is no substitute for watching a human try to use your system.
— Common refrain in UX research

The big idea: human evaluation is the ultimate check. It is slow and expensive, so use it sparingly — but never ignore it.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-human-evaluation-101

What is the core idea behind "Human Evaluation 101"?
1. Automatic metrics miss a lot. Humans catch what metrics cannot. Here is how to run a simple human eval.
2. oral
3. Bayes rule
4. HumanEval
Which term best describes a foundational idea in "Human Evaluation 101"?
1. rubric
2. human evaluation
3. inter-rater agreement
4. Cohen's Kappa
A learner studying Human Evaluation 101 would need to understand which concept?
1. human evaluation
2. inter-rater agreement
3. rubric
4. Cohen's Kappa
Which of these is directly relevant to Human Evaluation 101?
1. human evaluation
2. rubric
3. Cohen's Kappa
4. inter-rater agreement
Which of the following is a key point about Human Evaluation 101?
1. Write a clear rubric (what 'good' means)
2. Collect 30-50 prompts representative of real use
3. Generate responses from the models you want to compare
4. Show pairs blindly; ask which is better and why
Which of these does NOT belong in a discussion of Human Evaluation 101?
1. Collect 30-50 prompts representative of real use
2. Write a clear rubric (what 'good' means)
3. Generate responses from the models you want to compare
4. oral
Which statement is accurate regarding Human Evaluation 101?
1. Pay at least $15-20/hour — low pay corrupts results
2. Warn about disturbing content before the session
3. Amazon Mechanical Turk is cheap but quality varies; Prolific is more consistent
4. Keep a feedback channel open — raters see bugs in your rubric
Which of these does NOT belong in a discussion of Human Evaluation 101?
1. Pay at least $15-20/hour — low pay corrupts results
2. Amazon Mechanical Turk is cheap but quality varies; Prolific is more consistent
3. oral
4. Warn about disturbing content before the session
What is the key insight about "Always blind" in the context of Human Evaluation 101?
1. If raters know which model wrote which response, bias leaks in.
2. oral
3. Bayes rule
4. HumanEval
What is the key insight about "Fatigue is real" in the context of Human Evaluation 101?
1. oral
2. Raters rarely maintain quality past 60-90 minutes. Pay well, chunk into short sessions, and include catch items to verif…
3. Bayes rule
4. HumanEval
What is the recommended tip about "Build your mental model" in the context of Human Evaluation 101?
1. oral
2. Bayes rule
3. AI isn't magic — it's pattern recognition at scale. The more you understand how it works, the more effectively you can u…
4. HumanEval
Which statement accurately describes an aspect of Human Evaluation 101?
1. oral
2. Bayes rule
3. HumanEval
4. For judging quality, creativity, and helpfulness, there is no substitute for human eyes.
What does working with Human Evaluation 101 typically involve?
1. If your two raters disagree on half the items, your rubric is broken or your task is noisy.
2. oral
3. Bayes rule
4. HumanEval
Which of the following is true about Human Evaluation 101?
1. oral
2. The big idea: human evaluation is the ultimate check. It is slow and expensive, so use it sparingly — but never ignore it.
3. Bayes rule
4. HumanEval
Which best describes the scope of "Human Evaluation 101"?
1. It is unrelated to foundations workflows
2. It applies only to the opposite beginner tier
3. It focuses on Automatic metrics miss a lot. Humans catch what metrics cannot. Here is how to run a simple human ev
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson

Tendril · Builders · AI Foundations

Human Evaluation 101

Automatic metrics miss a lot. Humans catch what metrics cannot. Here is how to run a simple human eval.

28 min · Reviewed 2026

When Humans Are the Measuring Stick

For judging quality, creativity, and helpfulness, there is no substitute for human eyes. Human evaluation is expensive and slow — but when done well, it is the gold standard.

The minimum viable human eval

Write a clear rubric (what 'good' means)
Collect 30-50 prompts representative of real use
Generate responses from the models you want to compare
Show pairs blindly; ask which is better and why
Have at least two raters; measure agreement
Report with confidence intervals

Inter-rater agreement

If your two raters disagree on half the items, your rubric is broken or your task is noisy. Cohen's Kappa is a common agreement metric: above 0.6 is acceptable; above 0.8 is strong.

Cost and ethics

Amazon Mechanical Turk is cheap but quality varies; Prolific is more consistent
Pay at least $15-20/hour — low pay corrupts results
Warn about disturbing content before the session
Keep a feedback channel open — raters see bugs in your rubric

There is no substitute for watching a human try to use your system.
— Common refrain in UX research

The big idea: human evaluation is the ultimate check. It is slow and expensive, so use it sparingly — but never ignore it.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-builders-human-evaluation-101

What is the core idea behind "Human Evaluation 101"?
1. Automatic metrics miss a lot. Humans catch what metrics cannot. Here is how to run a simple human eval.
2. oral
3. Bayes rule
4. HumanEval
Which term best describes a foundational idea in "Human Evaluation 101"?
1. rubric
2. human evaluation
3. inter-rater agreement
4. Cohen's Kappa
A learner studying Human Evaluation 101 would need to understand which concept?
1. human evaluation
2. inter-rater agreement
3. rubric
4. Cohen's Kappa
Which of these is directly relevant to Human Evaluation 101?
1. human evaluation
2. rubric
3. Cohen's Kappa
4. inter-rater agreement
Which of the following is a key point about Human Evaluation 101?
1. Write a clear rubric (what 'good' means)
2. Collect 30-50 prompts representative of real use
3. Generate responses from the models you want to compare
4. Show pairs blindly; ask which is better and why
Which of these does NOT belong in a discussion of Human Evaluation 101?
1. Collect 30-50 prompts representative of real use
2. Write a clear rubric (what 'good' means)
3. Generate responses from the models you want to compare
4. oral
Which statement is accurate regarding Human Evaluation 101?
1. Pay at least $15-20/hour — low pay corrupts results
2. Warn about disturbing content before the session
3. Amazon Mechanical Turk is cheap but quality varies; Prolific is more consistent
4. Keep a feedback channel open — raters see bugs in your rubric
Which of these does NOT belong in a discussion of Human Evaluation 101?
1. Pay at least $15-20/hour — low pay corrupts results
2. Amazon Mechanical Turk is cheap but quality varies; Prolific is more consistent
3. oral
4. Warn about disturbing content before the session
What is the key insight about "Always blind" in the context of Human Evaluation 101?
1. If raters know which model wrote which response, bias leaks in.
2. oral
3. Bayes rule
4. HumanEval
What is the key insight about "Fatigue is real" in the context of Human Evaluation 101?
1. oral
2. Raters rarely maintain quality past 60-90 minutes. Pay well, chunk into short sessions, and include catch items to verif…
3. Bayes rule
4. HumanEval
What is the recommended tip about "Build your mental model" in the context of Human Evaluation 101?
1. oral
2. Bayes rule
3. AI isn't magic — it's pattern recognition at scale. The more you understand how it works, the more effectively you can u…
4. HumanEval
Which statement accurately describes an aspect of Human Evaluation 101?
1. oral
2. Bayes rule
3. HumanEval
4. For judging quality, creativity, and helpfulness, there is no substitute for human eyes.
What does working with Human Evaluation 101 typically involve?
1. If your two raters disagree on half the items, your rubric is broken or your task is noisy.
2. oral
3. Bayes rule
4. HumanEval
Which of the following is true about Human Evaluation 101?
1. oral
2. The big idea: human evaluation is the ultimate check. It is slow and expensive, so use it sparingly — but never ignore it.
3. Bayes rule
4. HumanEval
Which best describes the scope of "Human Evaluation 101"?
1. It is unrelated to foundations workflows
2. It applies only to the opposite beginner tier
3. It focuses on Automatic metrics miss a lot. Humans catch what metrics cannot. Here is how to run a simple human ev
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson