Loading lesson…
When AI can read documents and act on them, hidden instructions become attacks. Here is what prompt injection is and why nobody has fully solved it.
When a model only reads what you type, it is clear what counts as an instruction. When a model reads a webpage, a PDF, an email, or a tool response, the boundary gets fuzzy. Any text anywhere in its context window might say do this instead, and the model might obey.
In the web era, SQL injection happened because developers concatenated user input into database queries. The database could not tell code from data. Prompt injection has the same shape: the model cannot reliably tell instructions from content.
The difference is worse: SQL injection is fixed by parameterized queries. There is no known full fix for prompt injection. Current defenses reduce it; none eliminate it.
| Attacker goal | Vector | Defense |
|---|---|---|
| Exfiltrate data | Hidden text tells agent to email files | Block outbound email without human approval |
| Manipulate decisions | Biased content in a document | Cross-reference against trusted sources |
| Run harmful tools | Instruction to call delete-all tool | Require confirmation for destructive actions |
| Phish the user | Fake authority instruction in page | Warn user, never auto-click injected links |
Every piece of data an agent reads is a potential prompt. Design like every document is a letter from an adversary.
— Simon Willison, independent researcher
The big idea: the more useful AI agents get, the more they read from the world, and the more they read, the more attack surface they have. Mitigations exist; a perfect fix does not.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-prompt-injection-builders
Why do security experts call prompt injection 'the agent era's SQL injection'?
What makes the 'trust boundary problem' especially difficult for AI agents?
A researcher hidden text on a webpage that made Bing Chat respond in pirate English and insert a phishing link. What type of attack was this?
Which defense involves training an AI to treat certain instructions as more trustworthy than others?
Why is there no complete fix for prompt injection like there is for SQL injection?
What does 'capability gating' mean as a defense against prompt injection?
What is the 'big idea' about AI agents and prompt injection risk?
What does 'sandboxing' do as a defense against prompt injection?
An attacker hides instructions in a document telling an AI to email sensitive files to a stolen email address. What defense would help most against this?
What is 'content labeling' as a defense?
A webpage contains fake authority instructions telling an AI to present malicious links to users. What should users be warned about?
Why is it risky to let an AI agent take irreversible actions on content you haven't read?
The quote 'Every piece of data an agent reads is a potential prompt' means:
A biased document might trick an AI into making unfair decisions. What defense helps against this?
What makes prompt injection worse than SQL injection in terms of solutions?