Loading lesson…
If you query a closed model enough, you can sometimes reconstruct it. Here is the research on extraction attacks and what it means for proprietary AI.
Imagine a closed API: you send prompts, you get outputs. The model's weights are secret. Can you reconstruct enough of it to build a useful clone? This is model extraction, and the answer for many real scenarios is yes, partially, and for a price that is often surprisingly low.
| Target | Approximate cost (USD) | Quality of copy |
|---|---|---|
| Small classification API | $10-1,000 | Often near-parity |
| Mid-size chat model via outputs | $1,000-100,000 | Useful downstream model, not identical |
| Frontier chat model (full match) | Not currently feasible | Practical distillation gets you ~70-90% on benchmarks |
| Last-layer parameters | $100-$1,000 (when API allows) | Exact on queried dimensions |
When a model is exposed via API, you should assume that a lower-fidelity copy of it exists somewhere you do not control.
— Nicholas Carlini, Google DeepMind
The big idea: APIs leak. Extraction is not always cheap, but for most intermediate-quality models it is cheaper than people think. The field is adjusting, and so are the contracts, but the underlying game is adversarial and ongoing.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-model-extraction-creators
What is the fundamental first step in a model extraction attack against a closed API?
Which research paper demonstrated that ML-as-a-service models could be stolen for under $10?
In the context of model extraction, what is the definition of 'knowledge distillation'?
What specific API feature did Carlini, Jagielski et al. (2024) exploit to extract the projection matrix of several closed models?
Why might model distillation potentially violate API terms of service?
Which defensive technique embeds a signal in outputs that a distilled model would inherit?
What did various 2023-2024 papers find about fine-tuning smaller open models on outputs from frontier models like GPT-4?
What is the primary economic insight about model extraction presented in the lesson?
What attack specifically used numerical precision vulnerabilities in the logit-bias API feature?
What is the primary purpose of rate limiting as a defense against model extraction?
Why might model extraction be significant for export control policy?
The lesson's quote 'When a model is exposed via API, you should assume that a lower-fidelity copy of it exists somewhere you do not control' implies what about API security?
Which company was mentioned in the lesson as having pursued several alleged distillation cases in court?
What is 'output perturbation' as a defensive technique against extraction?
The lesson emphasizes that extraction attacks depend critically on what factor?