Model Extraction and Distillation Attacks

If you query a closed model enough, you can sometimes reconstruct it. Here is the research on extraction attacks and what it means for proprietary AI.

35 min · Reviewed 2026

Stealing a Model You Can Only Talk To

Imagine a closed API: you send prompts, you get outputs. The model's weights are secret. Can you reconstruct enough of it to build a useful clone? This is model extraction, and the answer for many real scenarios is yes, partially, and for a price that is often surprisingly low.

The basic attack

Collect a diverse set of input prompts (use a large language corpus).
Query the target model; log (prompt, output) pairs.
Train a student model on those pairs via standard fine-tuning or knowledge distillation.
The student approximates the teacher's behavior on the queried distribution.
If the student matches well enough on downstream tasks, the attack succeeded.

Real results

Tramèr et al. 2016: stole decision-tree-equivalents of ML-as-a-service models for under $10
Orekondy et al. 2019: Knockoff Nets extracted image classifiers from commercial APIs
Krishna et al. 2020: extracted BERT-class models for natural language
Carlini, Jagielski et al. 2024: Stealing Part of a Production Language Model recovered the last-layer embedding matrix of several OpenAI and Google models via API queries
Various 2023-2024 papers: fine-tuning smaller open models on outputs of GPT-4 produced surprisingly capable students

What extraction costs

Target	Approximate cost (USD)	Quality of copy
Small classification API	$10-1,000	Often near-parity
Mid-size chat model via outputs	$1,000-100,000	Useful downstream model, not identical
Frontier chat model (full match)	Not currently feasible	Practical distillation gets you ~70-90% on benchmarks
Last-layer parameters	$100-$1,000 (when API allows)	Exact on queried dimensions

Why this matters legally and strategically

Distillation may violate API terms of service (most now explicitly prohibit training competing models on outputs)
Open-model landscape includes many models whose quality traces to GPT-4-class teachers
Export control on a closed model is limited by extractability
Defensive watermarking of outputs can help prove a downstream model was distilled
Lab competition dynamics shift when the leader can be approximated cheaply

Defenses

Rate limiting: caps on queries per account
Monitoring for distributional probing patterns
Output perturbation: add small noise to logits at the cost of quality
Watermarking: embed a signal in outputs that a distilled model would inherit
Legal: terms of service, lawsuits (OpenAI has pursued several alleged distillation cases)

When a model is exposed via API, you should assume that a lower-fidelity copy of it exists somewhere you do not control.
— Nicholas Carlini, Google DeepMind

The big idea: APIs leak. Extraction is not always cheap, but for most intermediate-quality models it is cheaper than people think. The field is adjusting, and so are the contracts, but the underlying game is adversarial and ongoing.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-model-extraction-creators

What is the fundamental first step in a model extraction attack against a closed API?
1. Reverse engineer the binary file of the model
2. Purchase the model directly from the developer
3. Hack into the company's servers to steal weights
4. Collect diverse input prompts and query the target model
Which research paper demonstrated that ML-as-a-service models could be stolen for under $10?
1. Carlini, Jagielski et al. 2024
2. Orekondy et al. 2019
3. Tramèr et al. 2016
4. Krishna et al. 2020
In the context of model extraction, what is the definition of 'knowledge distillation'?
1. A legal framework for licensing model architectures
2. Training a student model on (prompt, output) pairs collected from a teacher model
3. A defensive technique that adds noise to prevent copying
4. A method to compress models by removing unnecessary layers
What specific API feature did Carlini, Jagielski et al. (2024) exploit to extract the projection matrix of several closed models?
1. Temperature parameter
2. Frequency penalty
3. Logit bias feature
4. Top-p sampling
Why might model distillation potentially violate API terms of service?
1. Distillation reduces server costs
2. Distillation improves model quality for free
3. Most APIs now explicitly prohibit training competing models on their outputs
4. Distillation always produces inferior models
Which defensive technique embeds a signal in outputs that a distilled model would inherit?
1. Watermarking
2. Output perturbation
3. Rate limiting
4. Monitoring for probing patterns
What did various 2023-2024 papers find about fine-tuning smaller open models on outputs from frontier models like GPT-4?
1. The process was ruled legally safe in all jurisdictions
2. The resulting models were always worse than the teacher
3. They required more compute than training from scratch
4. They produced surprisingly capable student models
What is the primary economic insight about model extraction presented in the lesson?
1. Extraction is cheaper than most people think for intermediate-quality models
2. Extraction is always extremely expensive
3. Extraction has no economic impact on AI markets
4. Extraction only works on small classification models
What attack specifically used numerical precision vulnerabilities in the logit-bias API feature?
1. Knockoff Nets
2. Carlini, Jagielski et al. 2024
3. Tramèr et al. 2016
4. Krishna et al. 2020
What is the primary purpose of rate limiting as a defense against model extraction?
1. To increase API revenue
2. To improve the quality of model outputs
3. To cap queries per account and prevent mass data collection
4. To reduce the latency of API responses
Why might model extraction be significant for export control policy?
1. It only affects domestic AI companies
2. It reduces the need for export controls entirely
3. It is an end-run around chip access controls like the US AI Diffusion Rule
4. It allows companies to avoid export taxes
The lesson's quote 'When a model is exposed via API, you should assume that a lower-fidelity copy of it exists somewhere you do not control' implies what about API security?
1. Any exposed model is inherently vulnerable to some degree of extraction
2. APIs are completely secure and cannot be exploited
3. Only open-source models can be extracted
4. Extraction is impossible with proper encryption
Which company was mentioned in the lesson as having pursued several alleged distillation cases in court?
1. OpenAI
2. Anthropic
3. Meta
4. Google
What is 'output perturbation' as a defensive technique against extraction?
1. Training only on verified public data
2. Encrypting all API responses
3. Adding small noise to logits at the cost of quality
4. Completely blocking all API access
The lesson emphasizes that extraction attacks depend critically on what factor?
1. The specific affordances provided by the API
2. The price of NVIDIA GPUs
3. The physical location of the model servers
4. The year the model was originally trained

← Back to interactive lesson

Tendril · Creators · Ethics & Society