Lesson 278 of 2116
Synthetic Data: When AI Trains on AI
Real data is expensive, private, or scarce. Synthetic data is generated by models themselves. It is rapidly becoming as important as scraped data.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Making Data From Thin Air
- 2synthetic data
- 3model collapse
- 4distillation
Concept cluster
Terms to connect while reading
Section 1
Making Data From Thin Air
Synthetic data is information generated by an algorithm rather than collected from the real world. In modern AI, this usually means using a large model to generate examples that train another model. Microsoft's Phi series famously used GPT-4 to generate textbook-quality training data.
Why labs want synthetic data
- Real data for rare scenarios (medical emergencies, edge cases) is scarce
- Scraped data carries legal and privacy risk
- Controlled generation lets you balance for underrepresented groups
- You can generate exactly the format you need (step-by-step reasoning, tool calls)
- It is often cheaper than paying humans to create examples
Real-world examples
Compare the options
| System | Use of synthetic data |
|---|---|
| Phi-3 (Microsoft) | Textbook-quality synthetic training examples |
| Llama 3.1 | Synthetic tool-use examples for agent capabilities |
| AlphaGo Zero | 100% self-play, no human games at all |
| Waymo self-driving | Simulator-generated driving scenarios |
| Medical AI | Synthetic patient records when real ones are HIPAA-protected |
The catch: model collapse
As AI-generated text floods the web, future models trained on scraped data may accidentally collapse. This is why labs now care deeply about dating their scrapes (ideally before 2022) and mixing synthetic with fresh human data carefully.
Distillation: the safest synthetic recipe
Distillation is a disciplined form of synthetic data. Use a big, smart teacher model to generate high-quality outputs. Use those outputs to train a smaller student model. This is how many efficient models (Claude Haiku, Gemini Flash, Llama 3.2-1B) get their intelligence.
A toy distillation pattern
# Simplified distillation loop
questions = load_hard_questions()
for q in questions:
# Teacher (large model) generates rich reasoning
teacher_answer = big_model.generate(q, reasoning=True)
# Student learns to mimic
student_loss = student_model.train_step(
input=q,
target=teacher_answer
)Key terms in this lesson
The big idea: synthetic data is not cheating, it is compression. The best synthetic data pipelines use large models to distill their knowledge into smaller, faster ones, and carefully guard against collapse.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Synthetic Data: When AI Trains on AI”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 22 min
The Turing Test and Its Discontents
The imitation game became famous, but most AI researchers now think it measures the wrong thing.
Creators · 30 min
The Data Broker Ecosystem: The Shadow Industry
Thousands of companies you have never heard of trade your personal data every second. Understanding this invisible market is understanding modern privacy. Brokers and AI training Much training data for specialized models (ad targeting, credit scoring, risk assessment) comes from brokers.
Creators · 45 min
Pandas Fundamentals in 40 Minutes
Pandas is the Python library that made data science what it is today. Ten verbs get you through 90 percent of day-to-day data work.
