How to compress a large model's behavior into a smaller, cheaper one.
11 min · Reviewed 2026
The premise
Distillation uses a large 'teacher' model to generate training data for a smaller 'student' model that approximates the teacher's behavior on a specific task at a fraction of the cost.
What AI does well here
Cutting per-call cost 5-20x for narrow, well-defined tasks
Reducing latency to enable real-time use cases
Running on cheaper hardware or even on-device
Capturing 80-95% of teacher quality for many specific tasks
What AI cannot do
Match the teacher on tasks outside the distillation set
Update easily as the teacher improves — re-distillation is needed
Replace the teacher for novel or open-ended tasks
End-of-lesson check
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-ai-foundations-distillation-final1-creators
What is the primary goal of model distillation?
To make a large model produce more accurate outputs
To replace frontier models entirely with smaller alternatives
To compress a large model's behavior into a smaller, cheaper model
To increase the training data available for machine learning
In teacher-student distillation, what is the 'teacher'?
The smaller model that learns from the student
A human expert who labels training data
A large, capable model whose behavior gets replicated
A dataset containing optimal input-output pairs
A company wants to reduce their AI API costs by 10x for a narrow, repetitive task. What approach would likely achieve this?
Switch to a rule-based system with hardcoded responses
Use a more powerful frontier model with more parameters
Apply distillation using their current model as the teacher
Deploy the same model on more expensive hardware
According to the recommended distillation process, how many input/output pairs should you generate from the frontier model?
100 to 500 pairs
As few as 10 pairs, but quality matters more than quantity
1000 to 5000 pairs
10,000 to 50,000 pairs
A distilled model performs poorly on a task slightly different from its training data. What term describes this limitation?
Model drift
Distribution shift
Brittleness outside training distribution
Overfitting to the teacher
Why might you need to perform distillation again after initially creating a student model?
The teacher model has been updated with new capabilities
The hardware has become faster
The student model has become too large
The student model has memorized all possible inputs
What is a key advantage of running a distilled model on-device rather than calling a remote API?
Access to more training data
Higher accuracy due to more computational power
Ability to use larger models than cloud services allow
Eliminated network latency and reduced per-call costs
Which of the following is NOT a typical benefit of model distillation?
80-95% of teacher quality retention
Complete replacement of the frontier model
5-20x reduction in per-call cost
Running on cheaper hardware or on-device
What type of task is distillation LEAST suitable for?
A simple sentiment analysis task
A structured data extraction task
An open-ended creative writing task
A narrow, repetitive classification task
What quality level can a well-distilled model typically achieve compared to its teacher?
10-30% of teacher quality
80-95% of teacher quality
100% or more of teacher quality
40-50% of teacher quality
After creating a distilled model, why should the original teacher model remain available?
To serve as a backup if the student model fails
To verify the student model's outputs are correct
To continuously train the student model in real-time
To handle the long tail of tasks the distilled model can't cover
In the distillation workflow, what is the purpose of fine-tuning the smaller model?
To reduce the model's size further after training
To add new capabilities the teacher doesn't have
To make the model larger and more capable
To adapt the pre-trained model to replicate the teacher's behavior on specific pairs
What does it mean that a distilled model is 'brittle outside its training distribution'?
The model fails when given inputs significantly different from its training examples
The model requires constant retraining to maintain performance
The model becomes less accurate over time
The model will eventually forget its training
What type of hardware can typically run a distilled model?
Hardware that requires constant internet connectivity
Cheaper hardware or even consumer devices
Only expensive data center GPUs
Only specialized AI accelerators costing thousands of dollars
How does distillation affect latency compared to calling a frontier model?