Tendril

Lesson 286 of 2116

Inter-Annotator Agreement: Measuring Reality

If two reasonable humans cannot agree on a label, neither can a model. Inter-annotator agreement tells you if a task is even well-defined.

CreatorsAI Foundations~17 min readAdvancedResearcherBI3 · LearningBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

28 min14 blocks3 concepts

Learning path

The main moves in order

1The Most Underused Metric
2inter-annotator agreement
3kappa
4labeling

Concept cluster

Terms to connect while reading

inter-annotator agreementkappalabeling

Sections4

Notes4

Code1

Compare1

Terms1

Section 1

The Most Underused Metric

Before you train a model to do a task, check if humans can even do it. If two qualified annotators disagree 30 percent of the time, a model at 85 percent accuracy might actually be superhuman. If humans agree 99 percent and the model hits 85, the model is the problem.

Ways to measure agreement

Compare the options

Metric	Good for	Range
Raw percent agreement	Quick sanity check	0 to 100%
Cohen's kappa	Two annotators, categorical labels	-1 to 1
Fleiss's kappa	Many annotators	-1 to 1
Krippendorff's alpha	Missing data, any label type	-1 to 1

Check-in 1. Got it so far?

A simple calculation

Computing annotator agreement

python

from sklearn.metrics import cohen_kappa_score

annotator_a = ['spam', 'not_spam', 'spam', 'spam', 'not_spam']
annotator_b = ['spam', 'not_spam', 'not_spam', 'spam', 'not_spam']

kappa = cohen_kappa_score(annotator_a, annotator_b)
print(f'Cohen\'s kappa: {kappa:.2f}')
# Agreement corrected for chance: 0.58 = moderate

When disagreement is a signal

High disagreement is not always a bug. For subjective tasks (is this joke funny?), disagreement reveals real human diversity. Modern work in perspectivism argues that for some tasks, we should train models to predict a distribution over labels, not collapse to a single ground truth.

Check-in 2. Got it so far?

Key terms in this lesson

The big idea: the upper bound on model performance is human agreement. If humans cannot do a task consistently, no model can. Measure agreement before you measure accuracy.

Check-in 3. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Inter-Annotator Agreement: Measuring Reality”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Inter-Annotator Agreement: Measuring Reality

The Most Underused Metric

Ways to measure agreement

A simple calculation

When disagreement is a signal

Curious about “Inter-Annotator Agreement: Measuring Reality”?

Keep going

Inter-Annotator Agreement: Measuring Reality

The Most Underused Metric

Ways to measure agreement

A simple calculation

When disagreement is a signal

Curious about “Inter-Annotator Agreement: Measuring Reality”?

Keep going