Lesson 286 of 2116
Inter-Annotator Agreement: Measuring Reality
If two reasonable humans cannot agree on a label, neither can a model. Inter-annotator agreement tells you if a task is even well-defined.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Most Underused Metric
- 2inter-annotator agreement
- 3kappa
- 4labeling
Concept cluster
Terms to connect while reading
Section 1
The Most Underused Metric
Before you train a model to do a task, check if humans can even do it. If two qualified annotators disagree 30 percent of the time, a model at 85 percent accuracy might actually be superhuman. If humans agree 99 percent and the model hits 85, the model is the problem.
Ways to measure agreement
Compare the options
| Metric | Good for | Range |
|---|---|---|
| Raw percent agreement | Quick sanity check | 0 to 100% |
| Cohen's kappa | Two annotators, categorical labels | -1 to 1 |
| Fleiss's kappa | Many annotators | -1 to 1 |
| Krippendorff's alpha | Missing data, any label type | -1 to 1 |
A simple calculation
Computing annotator agreement
from sklearn.metrics import cohen_kappa_score
annotator_a = ['spam', 'not_spam', 'spam', 'spam', 'not_spam']
annotator_b = ['spam', 'not_spam', 'not_spam', 'spam', 'not_spam']
kappa = cohen_kappa_score(annotator_a, annotator_b)
print(f'Cohen\'s kappa: {kappa:.2f}')
# Agreement corrected for chance: 0.58 = moderateWhen disagreement is a signal
High disagreement is not always a bug. For subjective tasks (is this joke funny?), disagreement reveals real human diversity. Modern work in perspectivism argues that for some tasks, we should train models to predict a distribution over labels, not collapse to a single ground truth.
Key terms in this lesson
The big idea: the upper bound on model performance is human agreement. If humans cannot do a task consistently, no model can. Measure agreement before you measure accuracy.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Inter-Annotator Agreement: Measuring Reality”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 35 min
Labeling at Scale: The Hidden Human Layer
Behind every supervised model is an army of human labelers. Understanding how labeling works is understanding who really builds AI.
Creators · 45 min
Creating Your First Small Labeled Dataset
Creating a dataset from scratch teaches you more than using someone else's. Here is how to build a high-quality small labeled dataset for a real task.
Creators · 45 min
Open vs. Closed Models: Philosophy and Strategy
Open-source AI is both a technical movement and a political one. Understand the arguments so you can pick a stack and defend it.
