Tendril — AI Lessons for Real Life

Tendril

The premise

Distillation uses a large 'teacher' model to generate training data for a smaller 'student' model that approximates the teacher's behavior on a specific task at a fraction of the cost.

What AI does well here

Cutting per-call cost 5-20x for narrow, well-defined tasks

Reducing latency to enable real-time use cases

Running on cheaper hardware or even on-device

Capturing 80-95% of teacher quality for many specific tasks

What AI cannot do

Match the teacher on tasks outside the distillation set

Update easily as the teacher improves — re-distillation is needed

Replace the teacher for novel or open-ended tasks

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-ai-foundations-distillation-final1-creators

What is the primary goal of model distillation?

To make a large model produce more accurate outputs
To replace frontier models entirely with smaller alternatives
To compress a large model's behavior into a smaller, cheaper model
To increase the training data available for machine learning

In teacher-student distillation, what is the 'teacher'?

The smaller model that learns from the student
A human expert who labels training data
A large, capable model whose behavior gets replicated
A dataset containing optimal input-output pairs

A company wants to reduce their AI API costs by 10x for a narrow, repetitive task. What approach would likely achieve this?

Switch to a rule-based system with hardcoded responses
Use a more powerful frontier model with more parameters
Apply distillation using their current model as the teacher
Deploy the same model on more expensive hardware

According to the recommended distillation process, how many input/output pairs should you generate from the frontier model?

100 to 500 pairs
As few as 10 pairs, but quality matters more than quantity
1000 to 5000 pairs
10,000 to 50,000 pairs

A distilled model performs poorly on a task slightly different from its training data. What term describes this limitation?

Model drift
Distribution shift
Brittleness outside training distribution
Overfitting to the teacher

Why might you need to perform distillation again after initially creating a student model?

The teacher model has been updated with new capabilities
The hardware has become faster
The student model has become too large
The student model has memorized all possible inputs

What is a key advantage of running a distilled model on-device rather than calling a remote API?

Access to more training data
Higher accuracy due to more computational power
Ability to use larger models than cloud services allow
Eliminated network latency and reduced per-call costs

Which of the following is NOT a typical benefit of model distillation?

80-95% of teacher quality retention
Complete replacement of the frontier model
5-20x reduction in per-call cost
Running on cheaper hardware or on-device

What type of task is distillation LEAST suitable for?

A simple sentiment analysis task
A structured data extraction task
An open-ended creative writing task
A narrow, repetitive classification task

What quality level can a well-distilled model typically achieve compared to its teacher?

10-30% of teacher quality
80-95% of teacher quality
100% or more of teacher quality
40-50% of teacher quality

After creating a distilled model, why should the original teacher model remain available?

To serve as a backup if the student model fails
To verify the student model's outputs are correct
To continuously train the student model in real-time
To handle the long tail of tasks the distilled model can't cover

In the distillation workflow, what is the purpose of fine-tuning the smaller model?

To reduce the model's size further after training
To add new capabilities the teacher doesn't have
To make the model larger and more capable
To adapt the pre-trained model to replicate the teacher's behavior on specific pairs

What does it mean that a distilled model is 'brittle outside its training distribution'?

The model fails when given inputs significantly different from its training examples
The model requires constant retraining to maintain performance
The model becomes less accurate over time
The model will eventually forget its training

What type of hardware can typically run a distilled model?

Hardware that requires constant internet connectivity
Cheaper hardware or even consumer devices
Only expensive data center GPUs
Only specialized AI accelerators costing thousands of dollars

How does distillation affect latency compared to calling a frontier model?

Latency becomes unpredictable
Latency increases significantly
Latency decreases substantially
Latency remains the same

The premise

Distillation uses a large 'teacher' model to generate training data for a smaller 'student' model that approximates the teacher's behavior on a specific task at a fraction of the cost.

What AI does well here

Cutting per-call cost 5-20x for narrow, well-defined tasks

Reducing latency to enable real-time use cases

Running on cheaper hardware or even on-device

Capturing 80-95% of teacher quality for many specific tasks

What AI cannot do

Match the teacher on tasks outside the distillation set

Update easily as the teacher improves — re-distillation is needed

Replace the teacher for novel or open-ended tasks

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-ai-foundations-distillation-final1-creators

What is the primary goal of model distillation?

To make a large model produce more accurate outputs
To replace frontier models entirely with smaller alternatives
To compress a large model's behavior into a smaller, cheaper model
To increase the training data available for machine learning

In teacher-student distillation, what is the 'teacher'?

The smaller model that learns from the student
A human expert who labels training data
A large, capable model whose behavior gets replicated
A dataset containing optimal input-output pairs

A company wants to reduce their AI API costs by 10x for a narrow, repetitive task. What approach would likely achieve this?

Switch to a rule-based system with hardcoded responses
Use a more powerful frontier model with more parameters
Apply distillation using their current model as the teacher
Deploy the same model on more expensive hardware

According to the recommended distillation process, how many input/output pairs should you generate from the frontier model?

100 to 500 pairs
As few as 10 pairs, but quality matters more than quantity
1000 to 5000 pairs
10,000 to 50,000 pairs

A distilled model performs poorly on a task slightly different from its training data. What term describes this limitation?

Model drift
Distribution shift
Brittleness outside training distribution
Overfitting to the teacher

Why might you need to perform distillation again after initially creating a student model?

The teacher model has been updated with new capabilities
The hardware has become faster
The student model has become too large
The student model has memorized all possible inputs

What is a key advantage of running a distilled model on-device rather than calling a remote API?

Access to more training data
Higher accuracy due to more computational power
Ability to use larger models than cloud services allow
Eliminated network latency and reduced per-call costs

Which of the following is NOT a typical benefit of model distillation?

80-95% of teacher quality retention
Complete replacement of the frontier model
5-20x reduction in per-call cost
Running on cheaper hardware or on-device

What type of task is distillation LEAST suitable for?

A simple sentiment analysis task
A structured data extraction task
An open-ended creative writing task
A narrow, repetitive classification task

What quality level can a well-distilled model typically achieve compared to its teacher?

10-30% of teacher quality
80-95% of teacher quality
100% or more of teacher quality
40-50% of teacher quality

After creating a distilled model, why should the original teacher model remain available?

To serve as a backup if the student model fails
To verify the student model's outputs are correct
To continuously train the student model in real-time
To handle the long tail of tasks the distilled model can't cover

In the distillation workflow, what is the purpose of fine-tuning the smaller model?

To reduce the model's size further after training
To add new capabilities the teacher doesn't have
To make the model larger and more capable
To adapt the pre-trained model to replicate the teacher's behavior on specific pairs

What does it mean that a distilled model is 'brittle outside its training distribution'?

The model fails when given inputs significantly different from its training examples
The model requires constant retraining to maintain performance
The model becomes less accurate over time
The model will eventually forget its training

What type of hardware can typically run a distilled model?

Hardware that requires constant internet connectivity
Cheaper hardware or even consumer devices
Only expensive data center GPUs
Only specialized AI accelerators costing thousands of dollars

How does distillation affect latency compared to calling a frontier model?

Latency becomes unpredictable
Latency increases significantly
Latency decreases substantially
Latency remains the same

Distillation: Making Big Models Cheap

The premise

What AI does well here

What AI cannot do

End-of-lesson check

Distillation: Making Big Models Cheap

The premise

What AI does well here

What AI cannot do

End-of-lesson check