Lesson 307 of 2116
Anonymization and Why It Often Fails
Removing names does not make data anonymous. Combinations of a few seemingly innocent fields can re-identify nearly anyone.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Illusion of Anonymity
- 2anonymization
- 3re-identification
- 4differential privacy
Concept cluster
Terms to connect while reading
Section 1
The Illusion of Anonymity
In 2006, Netflix released a supposedly anonymized dataset of 100 million movie ratings for a public prize. Researchers de-anonymized users by cross-referencing with public IMDB reviews. Similarly, the AOL search log leak in 2006 exposed identifiable users from supposedly scrubbed queries. Anonymization is hard, and naive anonymization is almost always broken.
The Sweeney result
Why naive anonymization fails
- Quasi-identifiers combine: age + ZIP + job is often unique
- Auxiliary data: an attacker can cross-reference with public sources
- High-dimensional data (location traces, browsing history) is almost always unique
- Rare attributes (unusual diseases, rare job titles) trivially identify
- Linking attacks: merge two 'anonymous' datasets and identities emerge
Formal techniques
Compare the options
| Technique | Strength | Weakness |
|---|---|---|
| Pseudonymization | Simple | Weakest, easily reversed |
| k-anonymity | Every record shares attributes with k-1 others | Vulnerable to homogeneity attacks |
| l-diversity | Adds variety in sensitive fields | Fails against skewed distributions |
| t-closeness | Sensitive distributions match the full dataset | Reduces utility substantially |
| Differential privacy | Mathematical guarantee | Adds noise, reduces accuracy |
Differential privacy: the gold standard
Differential privacy, formalized by Cynthia Dwork and colleagues in 2006, adds carefully calibrated noise so that the output of an analysis barely changes whether any one individual's data is included or not. Apple, Google, and the US Census all use differential privacy in production.
A one-line Laplace-mechanism example
import numpy as np
def dp_count(data, true_count, epsilon=1.0):
# Laplace noise scaled by sensitivity (1 for counts) / epsilon
noise = np.random.laplace(loc=0, scale=1/epsilon)
return true_count + noise
# 1000 people have condition X
# DP releases a noisy count that still tells useful stats
# but hides any single individual
noisy = dp_count(data=None, true_count=1000, epsilon=1.0)
print(f'Reported count: {noisy:.0f}')Practical guidance
- 1Do not release raw data, even with names removed
- 2Aggregate to coarse categories (age ranges, not ages)
- 3Generalize quasi-identifiers (5-digit ZIP becomes 3-digit)
- 4Suppress rare combinations
- 5Use differential privacy for any computation on sensitive data
- 6Assume attackers have auxiliary data you don't know about
Key terms in this lesson
The big idea: anonymization is harder than it looks. Formal techniques like differential privacy are the only reliable path. If you cannot afford formal guarantees, do not release the data.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Anonymization and Why It Often Fails”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 30 min
Debate Prep: Researching Both Sides Fast
Debate rewards knowing the other side's best argument better than they do. AI is built for exactly this kind of fast, balanced research.
Creators · 35 min
Running a Literature Review With AI
AI turns weeks of literature review into days — if you know how to use it. Here is a workflow that actually works.
Creators · 30 min
Citing AI-Assisted Work Honestly
The norms for disclosing AI use in research are still being written. Here is the emerging consensus and how to stay on the right side of it.
