Inter-Annotator Agreement: Measuring Reality

If two reasonable humans cannot agree on a label, neither can a model. Inter-annotator agreement tells you if a task is even well-defined.

28 min · Reviewed 2026

The Most Underused Metric

Before you train a model to do a task, check if humans can even do it. If two qualified annotators disagree 30 percent of the time, a model at 85 percent accuracy might actually be superhuman. If humans agree 99 percent and the model hits 85, the model is the problem.

Ways to measure agreement

Metric	Good for	Range
Raw percent agreement	Quick sanity check	0 to 100%
Cohen's kappa	Two annotators, categorical labels	-1 to 1
Fleiss's kappa	Many annotators	-1 to 1
Krippendorff's alpha	Missing data, any label type	-1 to 1

A simple calculation

from sklearn.metrics import cohen_kappa_score

annotator_a = ['spam', 'not_spam', 'spam', 'spam', 'not_spam']
annotator_b = ['spam', 'not_spam', 'not_spam', 'spam', 'not_spam']

kappa = cohen_kappa_score(annotator_a, annotator_b)
print(f'Cohen\'s kappa: {kappa:.2f}')
# Agreement corrected for chance: 0.58 = moderateComputing annotator agreement

When disagreement is a signal

High disagreement is not always a bug. For subjective tasks (is this joke funny?), disagreement reveals real human diversity. Modern work in perspectivism argues that for some tasks, we should train models to predict a distribution over labels, not collapse to a single ground truth.

The big idea: the upper bound on model performance is human agreement. If humans cannot do a task consistently, no model can. Measure agreement before you measure accuracy.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-annotator-agreement

What is the core idea behind "Inter-Annotator Agreement: Measuring Reality"?
1. If two reasonable humans cannot agree on a label, neither can a model. Inter-annotator agreement tells you if a task is even well-defined.
2. data-centric AI
3. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
4. Pandas is the Python library that made data science what it is today.
Which term best describes a foundational idea in "Inter-Annotator Agreement: Measuring Reality"?
1. Cohen's kappa
2. inter-annotator agreement
3. Fleiss's kappa
4. perspectivism
A learner studying Inter-Annotator Agreement: Measuring Reality would need to understand which concept?
1. inter-annotator agreement
2. Fleiss's kappa
3. Cohen's kappa
4. perspectivism
Which of these is directly relevant to Inter-Annotator Agreement: Measuring Reality?
1. inter-annotator agreement
2. Cohen's kappa
3. perspectivism
4. Fleiss's kappa
What is the key insight about "Interpreting kappa" in the context of Inter-Annotator Agreement: Measuring Reality?
1. Kappa below 0 means worse than chance. 0.0 to 0.2 is slight. 0.2 to 0.4 fair. 0.4 to 0.6 moderate. 0.6 to 0.
2. data-centric AI
3. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
4. Pandas is the Python library that made data science what it is today.
What is the key insight about "When disagreement kills a project" in the context of Inter-Annotator Agreement: Measuring Reality?
1. data-centric AI
2. If kappa is below 0.3, your task is probably ill-defined. Rewrite the annotation guidelines. Train annotators better.
3. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
4. Pandas is the Python library that made data science what it is today.
What is the recommended tip about "Ground your practice in fundamentals" in the context of Inter-Annotator Agreement: Measuring Reality?
1. data-centric AI
2. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
3. Every AI capability has an underlying mechanism. Understanding that mechanism tells you where it'll fail — which is more…
4. Pandas is the Python library that made data science what it is today.
Which statement accurately describes an aspect of Inter-Annotator Agreement: Measuring Reality?
1. data-centric AI
2. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
3. Pandas is the Python library that made data science what it is today.
4. Before you train a model to do a task, check if humans can even do it. If two qualified annotators disagree 30 percent of the time, a model …
What does working with Inter-Annotator Agreement: Measuring Reality typically involve?
1. High disagreement is not always a bug. For subjective tasks (is this joke funny?), disagreement reveals real human diversity.
2. data-centric AI
3. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
4. Pandas is the Python library that made data science what it is today.
Which of the following is true about Inter-Annotator Agreement: Measuring Reality?
1. data-centric AI
2. The big idea: the upper bound on model performance is human agreement. If humans cannot do a task consistently, no model can.
3. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
4. Pandas is the Python library that made data science what it is today.
Which best describes the scope of "Inter-Annotator Agreement: Measuring Reality"?
1. It is unrelated to foundations workflows
2. It applies only to the opposite beginner tier
3. It focuses on If two reasonable humans cannot agree on a label, neither can a model. Inter-annotator agreement tel
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Inter-Annotator Agreement: Measuring Reality?
1. data-centric AI
2. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
3. Pandas is the Python library that made data science what it is today.
4. Ways to measure agreement
Which section heading best belongs in a lesson about Inter-Annotator Agreement: Measuring Reality?
1. A simple calculation
2. data-centric AI
3. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
4. Pandas is the Python library that made data science what it is today.
Which section heading best belongs in a lesson about Inter-Annotator Agreement: Measuring Reality?
1. data-centric AI
2. When disagreement is a signal
3. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
4. Pandas is the Python library that made data science what it is today.
Which of the following is a concept covered in Inter-Annotator Agreement: Measuring Reality?
1. Cohen's kappa
2. Fleiss's kappa
3. inter-annotator agreement
4. perspectivism

← Back to interactive lesson

Tendril · Creators · AI Foundations

Inter-Annotator Agreement: Measuring Reality

If two reasonable humans cannot agree on a label, neither can a model. Inter-annotator agreement tells you if a task is even well-defined.

28 min · Reviewed 2026

The Most Underused Metric

Ways to measure agreement

Metric	Good for	Range
Raw percent agreement	Quick sanity check	0 to 100%
Cohen's kappa	Two annotators, categorical labels	-1 to 1
Fleiss's kappa	Many annotators	-1 to 1
Krippendorff's alpha	Missing data, any label type	-1 to 1

A simple calculation

from sklearn.metrics import cohen_kappa_score

annotator_a = ['spam', 'not_spam', 'spam', 'spam', 'not_spam']
annotator_b = ['spam', 'not_spam', 'not_spam', 'spam', 'not_spam']

kappa = cohen_kappa_score(annotator_a, annotator_b)
print(f'Cohen\'s kappa: {kappa:.2f}')
# Agreement corrected for chance: 0.58 = moderateComputing annotator agreement

When disagreement is a signal

The big idea: the upper bound on model performance is human agreement. If humans cannot do a task consistently, no model can. Measure agreement before you measure accuracy.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-annotator-agreement

What is the core idea behind "Inter-Annotator Agreement: Measuring Reality"?
1. If two reasonable humans cannot agree on a label, neither can a model. Inter-annotator agreement tells you if a task is even well-defined.
2. data-centric AI
3. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
4. Pandas is the Python library that made data science what it is today.
Which term best describes a foundational idea in "Inter-Annotator Agreement: Measuring Reality"?
1. Cohen's kappa
2. inter-annotator agreement
3. Fleiss's kappa
4. perspectivism
A learner studying Inter-Annotator Agreement: Measuring Reality would need to understand which concept?
1. inter-annotator agreement
2. Fleiss's kappa
3. Cohen's kappa
4. perspectivism
Which of these is directly relevant to Inter-Annotator Agreement: Measuring Reality?
1. inter-annotator agreement
2. Cohen's kappa
3. perspectivism
4. Fleiss's kappa
What is the key insight about "Interpreting kappa" in the context of Inter-Annotator Agreement: Measuring Reality?
1. Kappa below 0 means worse than chance. 0.0 to 0.2 is slight. 0.2 to 0.4 fair. 0.4 to 0.6 moderate. 0.6 to 0.
2. data-centric AI
3. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
4. Pandas is the Python library that made data science what it is today.
What is the key insight about "When disagreement kills a project" in the context of Inter-Annotator Agreement: Measuring Reality?
1. data-centric AI
2. If kappa is below 0.3, your task is probably ill-defined. Rewrite the annotation guidelines. Train annotators better.
3. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
4. Pandas is the Python library that made data science what it is today.
What is the recommended tip about "Ground your practice in fundamentals" in the context of Inter-Annotator Agreement: Measuring Reality?
1. data-centric AI
2. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
3. Every AI capability has an underlying mechanism. Understanding that mechanism tells you where it'll fail — which is more…
4. Pandas is the Python library that made data science what it is today.
Which statement accurately describes an aspect of Inter-Annotator Agreement: Measuring Reality?
1. data-centric AI
2. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
3. Pandas is the Python library that made data science what it is today.
4. Before you train a model to do a task, check if humans can even do it. If two qualified annotators disagree 30 percent of the time, a model …
What does working with Inter-Annotator Agreement: Measuring Reality typically involve?
1. High disagreement is not always a bug. For subjective tasks (is this joke funny?), disagreement reveals real human diversity.
2. data-centric AI
3. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
4. Pandas is the Python library that made data science what it is today.
Which of the following is true about Inter-Annotator Agreement: Measuring Reality?
1. data-centric AI
2. The big idea: the upper bound on model performance is human agreement. If humans cannot do a task consistently, no model can.
3. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
4. Pandas is the Python library that made data science what it is today.
Which best describes the scope of "Inter-Annotator Agreement: Measuring Reality"?
1. It is unrelated to foundations workflows
2. It applies only to the opposite beginner tier
3. It focuses on If two reasonable humans cannot agree on a label, neither can a model. Inter-annotator agreement tel
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Inter-Annotator Agreement: Measuring Reality?
1. data-centric AI
2. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
3. Pandas is the Python library that made data science what it is today.
4. Ways to measure agreement
Which section heading best belongs in a lesson about Inter-Annotator Agreement: Measuring Reality?
1. A simple calculation
2. data-centric AI
3. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
4. Pandas is the Python library that made data science what it is today.
Which section heading best belongs in a lesson about Inter-Annotator Agreement: Measuring Reality?
1. data-centric AI
2. When disagreement is a signal
3. Encoding gotchas: UTF-8 vs Latin-1 produces garbled text
4. Pandas is the Python library that made data science what it is today.
Which of the following is a concept covered in Inter-Annotator Agreement: Measuring Reality?
1. Cohen's kappa
2. Fleiss's kappa
3. inter-annotator agreement
4. perspectivism

← Back to interactive lesson