Lesson 240 of 1596
Representation Bias: Who Is in the Data?
If your training data is 90 percent men, your model will work worse for women. Representation bias is the most pervasive issue in AI.
Creators · AI Foundations · ~19 min read
The Gender Shades Study
In 2018, Joy Buolamwini and Timnit Gebru tested commercial face recognition systems from IBM, Microsoft, and Face++. Accuracy was nearly perfect for light-skinned men but dropped to 65 percent for dark-skinned women. The reason was brutally simple: the training data was overwhelmingly light-skinned men.
Where representation bias hides
- Speech recognition: worse for non-native accents, African American English, and children
- Image classification: worse for non-Western contexts (a photo of a Nigerian wedding might be labeled ceremony rather than wedding)
- Medical AI: trained mostly on white adult patients, fails on darker skin or pediatric cases
- Language models: fluent in English and Chinese, clumsy in Swahili and Tagalog
Detecting representation bias
A quick representation audit
import pandas as pd df = pd.read_csv('face_dataset.csv') # Check representation across demographic columns print(df['skin_tone'].value_counts(normalize=True)) print(df['gender'].value_counts(normalize=True)) print(df['age_group'].value_counts(normalize=True)) # Cross-tab: are some combinations missing? print(pd.crosstab(df['skin_tone'], df['gender'])) # Flag underrepresented groups threshold = 0.05 # 5% underrepresented = df['skin_tone'].value_counts(normalize=True) print('Underrepresented:', underrepresented[underrepresented < threshold])Fixing it
- 1Actively oversample from underrepresented groups during training
- 2Use stratified sampling when collecting new data
- 3Publicly report accuracy metrics per subgroup (not just overall)
- 4Set a minimum accuracy floor before deployment (no subgroup below X%)
- 5Invite audits by affected communities before release
Key terms in this lesson
The big idea: you cannot fix what you do not measure. Every serious ML deployment should report accuracy per group, not just an overall number that hides disparities.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “Representation Bias: Who Is in the Data?”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 30 min
Debate Prep: Researching Both Sides Fast
Debate rewards knowing the other side's best argument better than they do. AI is built for exactly this kind of fast, balanced research.
Creators · 35 min
Running a Literature Review With AI
AI turns weeks of literature review into days — if you know how to use it. Here is a workflow that actually works.
Creators · 30 min
Citing AI-Assisted Work Honestly
The norms for disclosing AI use in research are still being written. Here is the emerging consensus and how to stay on the right side of it.
