Lesson 285 of 2116
Label Noise: When Your Ground Truth Is Wrong
Every labeled dataset has mistakes. Studies have found error rates of 3 to 6 percent in famous benchmarks like ImageNet. Noisy labels confuse models and mislead evaluations.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Your Ground Truth Is Not Ground Truth
- 2label noise
- 3ground truth
- 4benchmark errors
Concept cluster
Terms to connect while reading
Section 1
Your Ground Truth Is Not Ground Truth
In 2021, researchers at MIT published Pervasive Label Errors in Test Sets, showing that the most famous ML benchmarks have significant error rates. ImageNet's test set had at least 5.8 percent mislabeled examples. MNIST had 0.15 percent. When benchmark models differ by tenths of a percent, this matters.
Types of label noise
- Random noise: uniform mistakes across classes
- Systematic noise: certain classes often confused (golden retriever vs. labrador)
- Adversarial noise: deliberately mislabeled data (data poisoning)
- Label flip: the correct label exists but was swapped
How noisy labels hurt
- 1Models learn the noise along with the signal
- 2Accuracy plateaus even with more data
- 3Benchmarks become unreliable ranking tools
- 4Real-world deployment reveals embarrassing failures
Using cleanlab to find mislabels
# Detecting likely label errors with confident learning
from cleanlab import Cleanlab
import numpy as np
# pred_probs: model predictions for each class, shape (N, K)
# labels: given labels, shape (N,)
lab = Cleanlab()
issues = lab.find_label_issues(
labels=labels,
pred_probs=pred_probs,
return_indices_ranked_by='self_confidence'
)
print(f'Likely mislabeled: {len(issues)} examples')
print('Top 10 suspects:', issues[:10])Mitigations
- Double-labeling: two annotators per item, resolve disagreements
- Active learning: train model, find low-confidence predictions, re-label
- Noise-robust loss functions (e.g., symmetric cross-entropy)
- Confident learning to flag statistical outliers
- Publish cleaned splits alongside originals so benchmarks improve over time
Key terms in this lesson
The big idea: no label is sacred. Every dataset has errors. Building systems that can measure and tolerate label noise is a core skill in production ML.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Label Noise: When Your Ground Truth Is Wrong”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Open vs. Closed Models: Philosophy and Strategy
Open-source AI is both a technical movement and a political one. Understand the arguments so you can pick a stack and defend it.
Creators · 45 min
Designing Your Own Eval
The eval that matters most is the one tied to your real task. Here is a step-by-step way to build one. The rubric is the product Most 'AI product' failures are actually rubric failures.
Creators · 40 min
Emergence vs. Scaling
Some capabilities grow smoothly with scale. Others seem to appear out of nowhere. Telling them apart is a whole research program. The Big Question Is AI capability a smooth climb or a staircase?
