Loading lesson…
Most jailbreaks come from a small number of patterns. Here are the ones that keep working, and why they are hard to kill. The Jailbreak Zoo A jailbreak is any prompt or setup that makes a model break its own rules.
A jailbreak is any prompt or setup that makes a model break its own rules. Since late 2022, researchers have catalogued dozens. Most are variations on a handful of patterns. If you understand the patterns, you can spot new ones.
Ask the model to pretend to be something else. The classic DAN (Do Anything Now) prompt from 2023 told ChatGPT it was a free version with no rules. Early versions complied. Fix: train models to refuse at the meta level, regardless of framing.
Ask for the bad thing in base64, rot13, pig latin, a made-up language. The safety classifier trained on English sometimes misses the foreign form. Fix: train on encoded versions too. New encodings keep appearing.
Anthropic researchers showed in 2024 that stuffing a long context window with hundreds of faux dialogues where a model cheerfully answers harmful questions eventually makes the real model comply. The attack exploits in-context learning itself.
CMU researchers in 2023 found garbled strings that, appended to any prompt, reliably unlock refused requests across many models. These are found via gradient optimization, not natural-language cleverness. They look like nonsense but work.
The model reads a webpage or document that contains instructions pretending to be from the user. The model follows them. This is the scariest family because agents with tools can do real damage.
| Family | Canonical example | Strong defense |
|---|---|---|
| Role-play | DAN | Meta-level refusal training |
| Encoding | Base64 the harmful ask | Train on encoded forms |
| Many-shot | 100 fake dialogues | Long-context safety fine-tuning |
| Adversarial suffix | GCG optimized strings | Adversarial training + detection |
| Indirect injection | Hidden text on a webpage | Content-origin rules, sandboxing |
Every jailbreak is a gift. It shows us the shape of the thing we didn't know we hadn't taught the model.
— An alignment researcher at Anthropic
The big idea: jailbreaks are not a moral failure of the model. They are the emergent consequence of training on following instructions. Studying the families is how the field actually improves.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-jailbreak-families-builders
A researcher asks an AI to pretend it is an unrestricted version of itself with no safety guidelines. Which jailbreak family does this represent?
An attacker writes a harmful request in a made-up language that the AI can understand but its safety filter cannot. What vulnerability does this exploit?
In 2024, Anthropic researchers demonstrated that filling a long context window with hundreds of fake dialogues where the AI answers harmful questions can make it comply with real harmful requests. What underlying capability does this attack exploit?
What makes adversarial suffixes (sometimes called GCG strings) different from other jailbreak families?
Why is indirect prompt injection considered the 'scariest' family of jailbreaks?
A website contains text invisible to human readers but visible to AI models, instructing the AI to ignore its safety guidelines. What is this technique called?
What defense effectively blocks role-play jailbreaks like DAN prompts?
What do companies like OpenAI, Anthropic, Google, and Meta typically offer through responsible disclosure programs?
Training safety classifiers on encoded versions of harmful content (like base64) addresses which weakness?
The lesson states that 'better models follow more kinds of instructions.' What is the intended implication for AI safety?
How does long-context safety fine-tuning help defend against many-shot jailbreaking?
Which defense approach is specifically designed to catch adversarial suffixes like GCG strings?
What two defenses are recommended against indirect prompt injection?
The lesson compares jailbreaks to 'gifts' because they reveal what the model wasn't taught. What does this imply about the purpose of studying jailbreaks?
In the role-play family of jailbreaks, what specific training approach made early DAN prompts ineffective?