The premise
Distillation captures most of a teacher model's behavior in a smaller student; what is lost is rarely uniform across tasks.
What AI does well here
- Sketch a distillation pipeline from teacher outputs to student training.
- Estimate per-task capability gap from teacher to student.
What AI cannot do
- Guarantee the student matches teacher on rare or hardest cases.
- Replace task-specific evaluation.
End-of-lesson check
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-distillation-fundamentals
What does it mean that the capability gap between teacher and student is 'rarely uniform across tasks'?
- The student model loses equal performance on every possible task
- Both models perform identically on simple tasks but differently on complex ones
- The gap can be eliminated by training longer
- The performance difference varies depending on which task is being evaluated
Why might relying solely on aggregate performance metrics be misleading when evaluating a distilled model?
- The student might perform well overall while failing badly on rare but high-stakes cases
- Aggregate metrics are always accurate and never misleading
- Aggregates require more computational resources to calculate
- Teachers rarely report aggregate metrics to students
What is the fundamental trade-off when using model distillation?
- Better privacy versus reduced performance
- Smaller model size versus some loss of capability compared to the teacher
- Faster training time versus lower energy consumption
- Higher accuracy versus simpler code
In a distillation pipeline, what data does the student model use for training?
- Randomly generated synthetic data
- The teacher model's outputs combined with input data
- Original human-labeled training data only
- A combination of the student's own predictions and human feedback
What does the term 'knowledge transfer' refer to in model distillation?
- The process of the student model learning to replicate the teacher model's behavior
- Converting explicit knowledge into implicit knowledge
- Moving a trained model from one server to another
- Exporting model weights to a different format
When designing an audit to compare teacher and student models, what is the purpose of defining 'acceptable gap thresholds'?
- To calculate the maximum training time allowed
- To determine how much the student can deviate from the teacher's pricing
- To set performance limits that trigger alerts when exceeded
- To measure the physical size difference between models
Why is it important to 'slice' your evaluations when assessing a distilled model?
- To reduce the total amount of data needed for testing
- To examine performance on specific subsets rather than just overall metrics
- To remove outlier results from the dataset
- To compare only the simplest test cases
What is a key reason a distilled student model cannot guarantee matching the teacher on rare or hardest cases?
- The training algorithm prevents learning from rare examples
- Student models are too small to handle any edge cases
- Rare cases often require knowledge the teacher itself may not have fully learned, making them harder to transfer
- Teachers deliberately withhold knowledge from students
What does the phrase 'mostly as good' in the lesson title imply about distilled models?
- They perform adequately on most tasks but with known limitations
- They require no further testing after distillation
- They should never be deployed in production
- They are completely identical to their teacher models
What is the primary motivation for deploying a distilled (student) model instead of the original teacher model?
- To achieve faster inference and lower computational costs
- To eliminate all potential biases
- To improve the model's reasoning capabilities
- To increase the amount of training data available
When creating an audit comparing teacher and student models, what does 'sample sizes' refer to?
- The total cost of training each model
- The file size of each model in megabytes
- The number of test examples evaluated per task category
- The amount of memory required to run each model
Why should task categories where distillation typically suffers most be flagged during evaluation?
- Because they prove the distillation process failed completely
- Because they require hiring additional engineers
- Because they indicate where the student may be unreliable and need monitoring or fallback
- Because these categories should be removed from the model entirely
What does it mean to 'estimate per-task capability gap'?
- Measure the performance difference between teacher and student on each individual task type
- Calculate the total number of parameters in both models
- Calculate the monetary cost difference between models
- Estimate the training time required for both models
In distillation, what is lost when compressing a large teacher model into a smaller student model?
- The model's ability to be updated
- The entire model architecture and design
- All ability to process text inputs
- Some capability, particularly on difficult or rare cases
What is the purpose of sketching a distillation pipeline from teacher outputs to student training?
- To determine the optimal number of teachers per student
- To calculate the exact memory requirements of each model
- To create visual diagrams for marketing materials
- To design the systematic flow of how knowledge transfers from teacher to student