Loading lesson…
The old mantra was more data always wins. The new reality is more complicated. Sometimes a small, hand-crafted dataset beats a giant messy one.
For years, the Google philosophy was simple: more data wins. Peter Norvig famously argued in The Unreasonable Effectiveness of Data that small clever algorithms plus massive datasets outperform complex algorithms on small data. This powered the first wave of deep learning.
Andrew Ng coined data-centric AI around 2021. His argument: instead of endlessly tweaking the model, spend your effort improving the data. For many real-world problems, fixing 100 mislabeled examples yields a bigger accuracy boost than tripling the dataset size.
| Model | Params | Training tokens | Notable quality data? |
|---|---|---|---|
| GPT-3 (2020) | 175B | 300B | Light filtering |
| Chinchilla (2022) | 70B | 1.4T | Heavy filtering, DeepMind |
| Phi-3 (2024) | 3.8B | 3.3T | Heavy synthetic textbook data |
| FineWeb-Edu (2024) | - | 1.3T | Education-focused filtering |
The big idea: the size-versus-quality debate is over. The answer is both, used for different stages. The skill is knowing which leverage to pull and when.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-data-big-vs-good
What is the core idea behind "Big Data vs. Good Data: The Tradeoff"?
Which term best describes a foundational idea in "Big Data vs. Good Data: The Tradeoff"?
A learner studying Big Data vs. Good Data: The Tradeoff would need to understand which concept?
Which of these is directly relevant to Big Data vs. Good Data: The Tradeoff?
Which of the following is a key point about Big Data vs. Good Data: The Tradeoff?
Which of these does NOT belong in a discussion of Big Data vs. Good Data: The Tradeoff?
Which statement is accurate regarding Big Data vs. Good Data: The Tradeoff?
Which of these does NOT belong in a discussion of Big Data vs. Good Data: The Tradeoff?
What is the key insight about "Phi's surprising result" in the context of Big Data vs. Good Data: The Tradeoff?
What is the recommended tip about "Ground your practice in fundamentals" in the context of Big Data vs. Good Data: The Tradeoff?
Which statement accurately describes an aspect of Big Data vs. Good Data: The Tradeoff?
What does working with Big Data vs. Good Data: The Tradeoff typically involve?
Which of the following is true about Big Data vs. Good Data: The Tradeoff?
Which best describes the scope of "Big Data vs. Good Data: The Tradeoff"?
Which section heading best belongs in a lesson about Big Data vs. Good Data: The Tradeoff?