If you query a closed model enough, you can sometimes reconstruct it. Here is the research on extraction attacks and what it means for proprietary AI.
35 min · Reviewed 2026
Stealing a Model You Can Only Talk To
Imagine a closed API: you send prompts, you get outputs. The model's weights are secret. Can you reconstruct enough of it to build a useful clone? This is model extraction, and the answer for many real scenarios is yes, partially, and for a price that is often surprisingly low.
The basic attack
Collect a diverse set of input prompts (use a large language corpus).
Query the target model; log (prompt, output) pairs.
Train a student model on those pairs via standard fine-tuning or knowledge distillation.
The student approximates the teacher's behavior on the queried distribution.
If the student matches well enough on downstream tasks, the attack succeeded.
Real results
Tramèr et al. 2016: stole decision-tree-equivalents of ML-as-a-service models for under $10
Orekondy et al. 2019: Knockoff Nets extracted image classifiers from commercial APIs
Krishna et al. 2020: extracted BERT-class models for natural language
Carlini, Jagielski et al. 2024: Stealing Part of a Production Language Model recovered the last-layer embedding matrix of several OpenAI and Google models via API queries
Various 2023-2024 papers: fine-tuning smaller open models on outputs of GPT-4 produced surprisingly capable students
What extraction costs
Target
Approximate cost (USD)
Quality of copy
Small classification API
$10-1,000
Often near-parity
Mid-size chat model via outputs
$1,000-100,000
Useful downstream model, not identical
Frontier chat model (full match)
Not currently feasible
Practical distillation gets you ~70-90% on benchmarks
Last-layer parameters
$100-$1,000 (when API allows)
Exact on queried dimensions
Why this matters legally and strategically
Distillation may violate API terms of service (most now explicitly prohibit training competing models on outputs)
Open-model landscape includes many models whose quality traces to GPT-4-class teachers
Export control on a closed model is limited by extractability
Defensive watermarking of outputs can help prove a downstream model was distilled
Lab competition dynamics shift when the leader can be approximated cheaply
Defenses
Rate limiting: caps on queries per account
Monitoring for distributional probing patterns
Output perturbation: add small noise to logits at the cost of quality
Watermarking: embed a signal in outputs that a distilled model would inherit
Legal: terms of service, lawsuits (OpenAI has pursued several alleged distillation cases)
When a model is exposed via API, you should assume that a lower-fidelity copy of it exists somewhere you do not control.
— Nicholas Carlini, Google DeepMind
The big idea: APIs leak. Extraction is not always cheap, but for most intermediate-quality models it is cheaper than people think. The field is adjusting, and so are the contracts, but the underlying game is adversarial and ongoing.
End-of-lesson check
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-model-extraction-creators
What is the main idea of "Model Extraction and Distillation Attacks"?
If you query a closed model enough, you can sometimes reconstruct it. Here is the research on extraction attacks and what it means for proprietary AI.
Use AI as the final authority for the whole decision
Avoid checking the answer once it sounds polished
Focus only on speed instead of judgment
Which concept is most central to "Model Extraction and Distillation Attacks"?
distillation attack
model extraction
query budget
knowledge distillation
Which use of AI fits this topic best?
Let the AI decide what matters without your review
Use the answer before checking whether it fits the situation
Collect a diverse set of input prompts (use a large language corpus).
Treat the AI output as automatically correct
What should a careful learner remember about "The 2024 Carlini attack is the big one"?
Use AI to draft or organize ideas about model extraction, then verify before acting.
Skip the context so the tool can guess faster
Treat the output as private even after sharing it online
Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
Act immediately because the AI answer is written clearly
AI cannot make the human values decision for you.
Hide uncertainty so the final answer looks cleaner
Use private or sensitive details before checking permission
How should AI output about model extraction be treated?
As proof that no other source is needed
As a replacement for context, consent, or expert review
As a draft or helper output that still needs human judgment and verification
As something that becomes correct when it sounds confident
Name one way to verify an AI answer about model extraction.
Which action would help you apply "Model Extraction and Distillation Attacks" responsibly?
Use the tool to avoid thinking through the tradeoff
Keep going even if the output conflicts with a trusted source
Treat the AI output as automatically correct
Query the target model; log (prompt, output) pairs.