Loading lesson…
If two reasonable humans cannot agree on a label, neither can a model. Inter-annotator agreement tells you if a task is even well-defined.
Before you train a model to do a task, check if humans can even do it. If two qualified annotators disagree 30 percent of the time, a model at 85 percent accuracy might actually be superhuman. If humans agree 99 percent and the model hits 85, the model is the problem.
| Metric | Good for | Range |
|---|---|---|
| Raw percent agreement | Quick sanity check | 0 to 100% |
| Cohen's kappa | Two annotators, categorical labels | -1 to 1 |
| Fleiss's kappa | Many annotators | -1 to 1 |
| Krippendorff's alpha | Missing data, any label type | -1 to 1 |
from sklearn.metrics import cohen_kappa_score
annotator_a = ['spam', 'not_spam', 'spam', 'spam', 'not_spam']
annotator_b = ['spam', 'not_spam', 'not_spam', 'spam', 'not_spam']
kappa = cohen_kappa_score(annotator_a, annotator_b)
print(f'Cohen\'s kappa: {kappa:.2f}')
# Agreement corrected for chance: 0.58 = moderateComputing annotator agreementHigh disagreement is not always a bug. For subjective tasks (is this joke funny?), disagreement reveals real human diversity. Modern work in perspectivism argues that for some tasks, we should train models to predict a distribution over labels, not collapse to a single ground truth.
The big idea: the upper bound on model performance is human agreement. If humans cannot do a task consistently, no model can. Measure agreement before you measure accuracy.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-annotator-agreement
What is the core idea behind "Inter-Annotator Agreement: Measuring Reality"?
Which term best describes a foundational idea in "Inter-Annotator Agreement: Measuring Reality"?
A learner studying Inter-Annotator Agreement: Measuring Reality would need to understand which concept?
Which of these is directly relevant to Inter-Annotator Agreement: Measuring Reality?
What is the key insight about "Interpreting kappa" in the context of Inter-Annotator Agreement: Measuring Reality?
What is the key insight about "When disagreement kills a project" in the context of Inter-Annotator Agreement: Measuring Reality?
What is the recommended tip about "Ground your practice in fundamentals" in the context of Inter-Annotator Agreement: Measuring Reality?
Which statement accurately describes an aspect of Inter-Annotator Agreement: Measuring Reality?
What does working with Inter-Annotator Agreement: Measuring Reality typically involve?
Which of the following is true about Inter-Annotator Agreement: Measuring Reality?
Which best describes the scope of "Inter-Annotator Agreement: Measuring Reality"?
Which section heading best belongs in a lesson about Inter-Annotator Agreement: Measuring Reality?
Which section heading best belongs in a lesson about Inter-Annotator Agreement: Measuring Reality?
Which section heading best belongs in a lesson about Inter-Annotator Agreement: Measuring Reality?
Which of the following is a concept covered in Inter-Annotator Agreement: Measuring Reality?