Lesson 280 of 2116
Big Data vs. Good Data: The Tradeoff
The old mantra was more data always wins. The new reality is more complicated. Sometimes a small, hand-crafted dataset beats a giant messy one.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Classic Big Data Argument
- 2data quality
- 3data-centric AI
- 4scaling
Concept cluster
Terms to connect while reading
Section 1
The Classic Big Data Argument
For years, the Google philosophy was simple: more data wins. Peter Norvig famously argued in The Unreasonable Effectiveness of Data that small clever algorithms plus massive datasets outperform complex algorithms on small data. This powered the first wave of deep learning.
The counter-movement: data-centric AI
Andrew Ng coined data-centric AI around 2021. His argument: instead of endlessly tweaking the model, spend your effort improving the data. For many real-world problems, fixing 100 mislabeled examples yields a bigger accuracy boost than tripling the dataset size.
Evidence from modern models
Compare the options
| Model | Params | Training tokens | Notable quality data? |
|---|---|---|---|
| GPT-3 (2020) | 175B | 300B | Light filtering |
| Chinchilla (2022) | 70B | 1.4T | Heavy filtering, DeepMind |
| Phi-3 (2024) | 3.8B | 3.3T | Heavy synthetic textbook data |
| FineWeb-Edu (2024) | - | 1.3T | Education-focused filtering |
When big still wins
- Broad world knowledge (need coverage of every topic)
- Rare languages and dialects
- Long-tail facts and entities
- Visual recognition in novel environments
When good wins
- Classification with clear labels (medical imaging)
- Reasoning with step-by-step solutions
- Instruction-following (a few thousand great examples beat millions of scraped Q&A pairs)
- Domain-specific tasks (legal, financial)
Key terms in this lesson
The big idea: the size-versus-quality debate is over. The answer is both, used for different stages. The skill is knowing which leverage to pull and when.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Big Data vs. Good Data: The Tradeoff”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 40 min
Emergence vs. Scaling
Some capabilities grow smoothly with scale. Others seem to appear out of nowhere. Telling them apart is a whole research program. The Big Question Is AI capability a smooth climb or a staircase?
Creators · 45 min
Open vs. Closed Models: Philosophy and Strategy
Open-source AI is both a technical movement and a political one. Understand the arguments so you can pick a stack and defend it.
Creators · 32 min
AP Biology: Using AI to Survive the Vocab Tsunami
AP Bio has roughly a thousand terms and four big concepts. NotebookLM and Claude Projects can turn your textbook into a custom tutor that actually knows what you are studying.
