Lesson 287 of 2116
Underrepresented Groups: Building Inclusive Datasets
Small populations get hurt first when datasets are built carelessly. Fixing this requires intentional collection, not just better algorithms.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The 1 Percent Problem
- 2underrepresentation
- 3inclusive datasets
- 4oversampling
Concept cluster
Terms to connect while reading
Section 1
The 1 Percent Problem
Imagine your training data is 99 percent able-bodied adults. Then a model trained on it is deployed to help with accessibility. It fails for people who use wheelchairs, screen readers, or sign language. Not because the team was malicious, but because the 1 percent was invisible in the data.
Common groups that get underrepresented
- People with disabilities (blind, deaf, motor impairment)
- Non-English speakers and speakers of creoles/dialects
- Indigenous peoples and their languages
- Rural populations (data skews urban)
- Elderly adults (tech adoption skews younger)
- Children (extra privacy protections limit data collection)
- Trans and non-binary people (demographic forms often exclude them)
Why small groups stay small
- 1Random sampling replicates population ratios
- 2Self-selected online data over-samples tech-heavy demographics
- 3Data collection platforms (Mechanical Turk) have demographic skews
- 4Privacy rules (rightly) make minority data harder to collect
- 5Researchers often do not have community connections
Intentional inclusion
Oversampling to rebalance minorities
import pandas as pd
import numpy as np
df = pd.read_csv('training_data.csv')
# Oversample minority groups to equal representation
def rebalance(df, group_col, target_size=None):
groups = df[group_col].unique()
if target_size is None:
target_size = df[group_col].value_counts().max()
balanced = []
for g in groups:
subset = df[df[group_col] == g]
balanced.append(subset.sample(target_size, replace=True, random_state=42))
return pd.concat(balanced).sample(frac=1, random_state=42)
balanced_df = rebalance(df, 'demographic_group')
print(balanced_df['demographic_group'].value_counts())Beyond oversampling
- Partner with community organizations for targeted data collection
- Pay data contributors from underrepresented groups fair wages
- Report metrics per subgroup in every benchmark
- Give affected communities a say in how the data is used
- Allow data contributors to withdraw their data if the use changes
Key terms in this lesson
The big idea: equal representation in data is not automatic. It takes deliberate effort, community relationships, and willingness to reshape your sampling priorities. Inclusive data is always the result of intentional choices.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Underrepresented Groups: Building Inclusive Datasets”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 30 min
Debate Prep: Researching Both Sides Fast
Debate rewards knowing the other side's best argument better than they do. AI is built for exactly this kind of fast, balanced research.
Creators · 35 min
Running a Literature Review With AI
AI turns weeks of literature review into days — if you know how to use it. Here is a workflow that actually works.
Creators · 30 min
Citing AI-Assisted Work Honestly
The norms for disclosing AI use in research are still being written. Here is the emerging consensus and how to stay on the right side of it.
