neural-forge.io

Sign inStartOpen studio

Tendril

AI Foundations0%

Lesson 245 of 1596

Underrepresented Groups: Building Inclusive Datasets

Small populations get hurt first when datasets are built carelessly. Fixing this requires intentional collection, not just better algorithms.

Creators · AI Foundations · ~18 min read

The 1 Percent Problem

Imagine your training data is 99 percent able-bodied adults. Then a model trained on it is deployed to help with accessibility. It fails for people who use wheelchairs, screen readers, or sign language. Not because the team was malicious, but because the 1 percent was invisible in the data.

Common groups that get underrepresented

People with disabilities (blind, deaf, motor impairment)
Non-English speakers and speakers of creoles/dialects
Indigenous peoples and their languages
Rural populations (data skews urban)
Elderly adults (tech adoption skews younger)
Children (extra privacy protections limit data collection)
Trans and non-binary people (demographic forms often exclude them)

Why small groups stay small

1Random sampling replicates population ratios
2Self-selected online data over-samples tech-heavy demographics
3Data collection platforms (Mechanical Turk) have demographic skews
4Privacy rules (rightly) make minority data harder to collect
5Researchers often do not have community connections

Intentional inclusion

Oversampling to rebalance minorities

python

import pandas as pd import numpy as np df = pd.read_csv('training_data.csv') # Oversample minority groups to equal representation def rebalance(df, group_col, target_size=None): groups = df[group_col].unique() if target_size is None: target_size = df[group_col].value_counts().max() balanced = [] for g in groups: subset = df[df[group_col] == g] balanced.append(subset.sample(target_size, replace=True, random_state=42)) return pd.concat(balanced).sample(frac=1, random_state=42) balanced_df = rebalance(df, 'demographic_group') print(balanced_df['demographic_group'].value_counts())

Beyond oversampling

Partner with community organizations for targeted data collection
Pay data contributors from underrepresented groups fair wages
Report metrics per subgroup in every benchmark
Give affected communities a say in how the data is used
Allow data contributors to withdraw their data if the use changes

Key terms in this lesson

The big idea: equal representation in data is not automatic. It takes deliberate effort, community relationships, and willingness to reshape your sampling priorities. Inclusive data is always the result of intentional choices.

End-of-lesson quiz

Check what stuck

8 questions · Score saves to your progress.

Tutor

Curious about “Underrepresented Groups: Building Inclusive Datasets”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going