Loading lesson…
Most jailbreaks come from a small number of patterns. Here are the ones that keep working, and why they are hard to kill. The Jailbreak Zoo A jailbreak is any prompt or setup that makes a model break its own rules.
A jailbreak is any prompt or setup that makes a model break its own rules. Since late 2022, researchers have catalogued dozens. Most are variations on a handful of patterns. If you understand the patterns, you can spot new ones.
Ask the model to pretend to be something else. The classic DAN (Do Anything Now) prompt from 2023 told ChatGPT it was a free version with no rules. Early versions complied. Fix: train models to refuse at the meta level, regardless of framing.
Ask for the bad thing in base64, rot13, pig latin, a made-up language. The safety classifier trained on English sometimes misses the foreign form. Fix: train on encoded versions too. New encodings keep appearing.
Anthropic researchers showed in 2024 that stuffing a long context window with hundreds of faux dialogues where a model cheerfully answers harmful questions eventually makes the real model comply. The attack exploits in-context learning itself.
CMU researchers in 2023 found garbled strings that, appended to any prompt, reliably unlock refused requests across many models. These are found via gradient optimization, not natural-language cleverness. They look like nonsense but work.
The model reads a webpage or document that contains instructions pretending to be from the user. The model follows them. This is the scariest family because agents with tools can do real damage.
| Family | Canonical example | Strong defense |
|---|---|---|
| Role-play | DAN | Meta-level refusal training |
| Encoding | Base64 the harmful ask | Train on encoded forms |
| Many-shot | 100 fake dialogues | Long-context safety fine-tuning |
| Adversarial suffix | GCG optimized strings | Adversarial training + detection |
| Indirect injection | Hidden text on a webpage | Content-origin rules, sandboxing |
Every jailbreak is a gift. It shows us the shape of the thing we didn't know we hadn't taught the model.
— An alignment researcher at Anthropic
The big idea: jailbreaks are not a moral failure of the model. They are the emergent consequence of training on following instructions. Studying the families is how the field actually improves.
6 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-jailbreak-families-builders
What is the main idea of "Jailbreaks: The Families You Will See"?
Which concept is most central to "Jailbreaks: The Families You Will See"?
What should a careful learner remember about "The structural problem"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about jailbreak be treated?
Name one way to verify an AI answer about jailbreak.