Lesson 197 of 1596
Model Extraction and Distillation Attacks
If you query a closed model enough, you can sometimes reconstruct it. Here is the research on extraction attacks and what it means for proprietary AI.
Creators · Ethics & Society · ~21 min read
Stealing a Model You Can Only Talk To
Imagine a closed API: you send prompts, you get outputs. The model's weights are secret. Can you reconstruct enough of it to build a useful clone? This is model extraction, and the answer for many real scenarios is yes, partially, and for a price that is often surprisingly low.
The basic attack
- 1Collect a diverse set of input prompts (use a large language corpus).
- 2Query the target model; log (prompt, output) pairs.
- 3Train a student model on those pairs via standard fine-tuning or knowledge distillation.
- 4The student approximates the teacher's behavior on the queried distribution.
- 5If the student matches well enough on downstream tasks, the attack succeeded.
Real results
- Tramèr et al. 2016: stole decision-tree-equivalents of ML-as-a-service models for under $10
- Orekondy et al. 2019: Knockoff Nets extracted image classifiers from commercial APIs
- Krishna et al. 2020: extracted BERT-class models for natural language
- Carlini, Jagielski et al. 2024: Stealing Part of a Production Language Model recovered the last-layer embedding matrix of several OpenAI and Google models via API queries
- Various 2023-2024 papers: fine-tuning smaller open models on outputs of GPT-4 produced surprisingly capable students
What extraction costs
Compare the options
| Target | Approximate cost (USD) | Quality of copy |
|---|---|---|
| Small classification API | $10-1,000 | Often near-parity |
| Mid-size chat model via outputs | $1,000-100,000 | Useful downstream model, not identical |
| Frontier chat model (full match) | Not currently feasible | Practical distillation gets you ~70-90% on benchmarks |
| Last-layer parameters | $100-$1,000 (when API allows) | Exact on queried dimensions |
Why this matters legally and strategically
- Distillation may violate API terms of service (most now explicitly prohibit training competing models on outputs)
- Open-model landscape includes many models whose quality traces to GPT-4-class teachers
- Export control on a closed model is limited by extractability
- Defensive watermarking of outputs can help prove a downstream model was distilled
- Lab competition dynamics shift when the leader can be approximated cheaply
Defenses
- Rate limiting: caps on queries per account
- Monitoring for distributional probing patterns
- Output perturbation: add small noise to logits at the cost of quality
- Watermarking: embed a signal in outputs that a distilled model would inherit
- Legal: terms of service, lawsuits (OpenAI has pursued several alleged distillation cases)
“When a model is exposed via API, you should assume that a lower-fidelity copy of it exists somewhere you do not control.”
Key terms in this lesson
The big idea: APIs leak. Extraction is not always cheap, but for most intermediate-quality models it is cheaper than people think. The field is adjusting, and so are the contracts, but the underlying game is adversarial and ongoing.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “Model Extraction and Distillation Attacks”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
Creators · 55 min
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
Creators · 45 min
Constitutional AI: A Deep Dive on Anthropic's Approach
What a constitution actually contains, how the training loop works, where the research is now, and the honest trade-offs.
