Loading lesson…
If two reasonable humans cannot agree on a label, neither can a model. Inter-annotator agreement tells you if a task is even well-defined.
Before you train a model to do a task, check if humans can even do it. If two qualified annotators disagree 30 percent of the time, a model at 85 percent accuracy might actually be superhuman. If humans agree 99 percent and the model hits 85, the model is the problem.
| Metric | Good for | Range |
|---|---|---|
| Raw percent agreement | Quick sanity check | 0 to 100% |
| Cohen's kappa | Two annotators, categorical labels | -1 to 1 |
| Fleiss's kappa | Many annotators | -1 to 1 |
| Krippendorff's alpha | Missing data, any label type | -1 to 1 |
from sklearn.metrics import cohen_kappa_score annotator_a = ['spam', 'not_spam', 'spam', 'spam', 'not_spam'] annotator_b = ['spam', 'not_spam', 'not_spam', 'spam', 'not_spam'] kappa = cohen_kappa_score(annotator_a, annotator_b) print(f'Cohen\'s kappa: {kappa:.2f}') # Agreement corrected for chance: 0.58 = moderateComputing annotator agreementHigh disagreement is not always a bug. For subjective tasks (is this joke funny?), disagreement reveals real human diversity. Modern work in perspectivism argues that for some tasks, we should train models to predict a distribution over labels, not collapse to a single ground truth.
The big idea: the upper bound on model performance is human agreement. If humans cannot do a task consistently, no model can. Measure agreement before you measure accuracy.
6 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-annotator-agreement
What is the main idea of "Inter-Annotator Agreement: Measuring Reality"?
Which concept is most central to "Inter-Annotator Agreement: Measuring Reality"?
What should a careful learner remember about "Interpreting kappa"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about inter-annotator agreement be treated?
Name one way to verify an AI answer about inter-annotator agreement.