Lesson 221 of 2116
Model Extraction and Distillation Attacks
If you query a closed model enough, you can sometimes reconstruct it. Here is the research on extraction attacks and what it means for proprietary AI.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Stealing a Model You Can Only Talk To
- 2model extraction
- 3distillation attack
- 4query budget
Concept cluster
Terms to connect while reading
Section 1
Stealing a Model You Can Only Talk To
Imagine a closed API: you send prompts, you get outputs. The model's weights are secret. Can you reconstruct enough of it to build a useful clone? This is model extraction, and the answer for many real scenarios is yes, partially, and for a price that is often surprisingly low.
The basic attack
- 1Collect a diverse set of input prompts (use a large language corpus).
- 2Query the target model; log (prompt, output) pairs.
- 3Train a student model on those pairs via standard fine-tuning or knowledge distillation.
- 4The student approximates the teacher's behavior on the queried distribution.
- 5If the student matches well enough on downstream tasks, the attack succeeded.
Real results
- Tramèr et al. 2016: stole decision-tree-equivalents of ML-as-a-service models for under $10
- Orekondy et al. 2019: Knockoff Nets extracted image classifiers from commercial APIs
- Krishna et al. 2020: extracted BERT-class models for natural language
- Carlini, Jagielski et al. 2024: Stealing Part of a Production Language Model recovered the last-layer embedding matrix of several OpenAI and Google models via API queries
- Various 2023-2024 papers: fine-tuning smaller open models on outputs of GPT-4 produced surprisingly capable students
What extraction costs
Compare the options
| Target | Approximate cost (USD) | Quality of copy |
|---|---|---|
| Small classification API | $10-1,000 | Often near-parity |
| Mid-size chat model via outputs | $1,000-100,000 | Useful downstream model, not identical |
| Frontier chat model (full match) | Not currently feasible | Practical distillation gets you ~70-90% on benchmarks |
| Last-layer parameters | $100-$1,000 (when API allows) | Exact on queried dimensions |
Why this matters legally and strategically
- Distillation may violate API terms of service (most now explicitly prohibit training competing models on outputs)
- Open-model landscape includes many models whose quality traces to GPT-4-class teachers
- Export control on a closed model is limited by extractability
- Defensive watermarking of outputs can help prove a downstream model was distilled
- Lab competition dynamics shift when the leader can be approximated cheaply
Defenses
- Rate limiting: caps on queries per account
- Monitoring for distributional probing patterns
- Output perturbation: add small noise to logits at the cost of quality
- Watermarking: embed a signal in outputs that a distilled model would inherit
- Legal: terms of service, lawsuits (OpenAI has pursued several alleged distillation cases)
“When a model is exposed via API, you should assume that a lower-fidelity copy of it exists somewhere you do not control.”
Key terms in this lesson
The big idea: APIs leak. Extraction is not always cheap, but for most intermediate-quality models it is cheaper than people think. The field is adjusting, and so are the contracts, but the underlying game is adversarial and ongoing.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Model Extraction and Distillation Attacks”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
Creators · 55 min
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
Creators · 45 min
Constitutional AI: A Deep Dive on Anthropic's Approach
What a constitution actually contains, how the training loop works, where the research is now, and the honest trade-offs.
