Loading lesson…
The old mantra was more data always wins. The new reality is more complicated. Sometimes a small, hand-crafted dataset beats a giant messy one.
For years, the Google philosophy was simple: more data wins. Peter Norvig famously argued in The Unreasonable Effectiveness of Data that small clever algorithms plus massive datasets outperform complex algorithms on small data. This powered the first wave of deep learning.
Andrew Ng coined data-centric AI around 2021. His argument: instead of endlessly tweaking the model, spend your effort improving the data. For many real-world problems, fixing 100 mislabeled examples yields a bigger accuracy boost than tripling the dataset size.
| Model | Params | Training tokens | Notable quality data? |
|---|---|---|---|
| GPT-3 (2020) | 175B | 300B | Light filtering |
| Chinchilla (2022) | 70B | 1.4T | Heavy filtering, DeepMind |
| Phi-3 (2024) | 3.8B | 3.3T | Heavy synthetic textbook data |
| FineWeb-Edu (2024) | - | 1.3T | Education-focused filtering |
The big idea: the size-versus-quality debate is over. The answer is both, used for different stages. The skill is knowing which leverage to pull and when.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-big-vs-good
What is the main idea of "Big Data vs. Good Data: The Tradeoff"?
Which concept is most central to "Big Data vs. Good Data: The Tradeoff"?
Which use of AI fits this topic best?
What should a careful learner remember about "Phi's surprising result"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about data quality be treated?
Name one way to verify an AI answer about data quality.
Which action would help you apply "Big Data vs. Good Data: The Tradeoff" responsibly?