Lesson 2076 of 2116
Distillation: Making Big Models Cheap
How to compress a large model's behavior into a smaller, cheaper one.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The premise
- 2distillation
- 3teacher-student
- 4fine-tuning
Concept cluster
Terms to connect while reading
Section 1
The premise
Distillation uses a large 'teacher' model to generate training data for a smaller 'student' model that approximates the teacher's behavior on a specific task at a fraction of the cost.
What AI does well here
- Cutting per-call cost 5-20x for narrow, well-defined tasks
- Reducing latency to enable real-time use cases
- Running on cheaper hardware or even on-device
- Capturing 80-95% of teacher quality for many specific tasks
What AI cannot do
- Match the teacher on tasks outside the distillation set
- Update easily as the teacher improves — re-distillation is needed
- Replace the teacher for novel or open-ended tasks
Key terms in this lesson
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Distillation: Making Big Models Cheap”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 11 min
Distillation Tradeoffs: When Smaller Models Quietly Lose
Distilled models look great on aggregate evals but quietly lose long-tail capabilities — the tradeoff matrix matters for production decisions.
Creators · 11 min
Fine-Tuning vs Prompting vs RAG: Choosing the Right Tool
When to fine-tune, when to prompt-engineer, and when to retrieve.
Creators · 35 min
Transfer Learning
Models trained on one task can often do many others. Understanding why is one of the deepest lessons in modern ML.
