Lesson 289 of 2116
Language Bias: Why English Dominates AI
English is 6 percent of the world's speakers but 50+ percent of the training data. This asymmetry shapes every model we use.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The English-First Internet
- 2language bias
- 3low-resource languages
- 4multilingual models
Concept cluster
Terms to connect while reading
Section 1
The English-First Internet
About 1.5 billion people speak English, roughly 20 percent of humanity (including second-language speakers). Yet over half of the internet's content is in English. When models are trained on the web, they inherit this imbalance and amplify it.
Three tiers of languages in AI
Compare the options
| Tier | Example languages | Model support |
|---|---|---|
| Well-resourced | English, Chinese, Spanish | Full fluency, billions of tokens |
| Medium-resourced | Arabic, Russian, Portuguese | Decent but uneven |
| Low-resourced | Swahili, Yoruba, Quechua | Struggles, makes errors |
| No-resourced | Most of world's 7000+ languages | Essentially no support |
Where language bias shows up
- Factual errors: model confidently wrong about non-English topics
- Grammar errors: confuses gendered languages, tonal languages
- Code-switching: struggles when speakers mix languages naturally
- Translation drift: goes through English pivoting, losing nuance
- Reasoning degradation: same prompt in low-resource language gets worse answers
The multilingual paradox
Researchers found that models often reason better in English even when answering in another language. They internally translate to English, reason, and translate back, losing fidelity each step. This creates a two-tier system where English speakers get better AI even when using native-language interfaces.
Efforts that matter
- Mozilla Common Voice: 100+ languages, community-collected speech
- Masakhane: African NLP with datasets for 40+ languages
- NLLB (Meta's No Language Left Behind): translation for 200 languages
- FineWeb-2: multilingual version of FineWeb released in 2024
- Aya: collaborative multilingual instruction dataset from Cohere
Key terms in this lesson
The big idea: AI is not language-neutral. Which languages have data determines which cultures thrive in the AI era. The future of linguistic diversity depends on where the data flows.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Language Bias: Why English Dominates AI”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 30 min
Debate Prep: Researching Both Sides Fast
Debate rewards knowing the other side's best argument better than they do. AI is built for exactly this kind of fast, balanced research.
Creators · 35 min
Running a Literature Review With AI
AI turns weeks of literature review into days — if you know how to use it. Here is a workflow that actually works.
Creators · 30 min
Citing AI-Assisted Work Honestly
The norms for disclosing AI use in research are still being written. Here is the emerging consensus and how to stay on the right side of it.
