Lesson 256 of 1570
Jailbreaks: The Families You Will See
Most jailbreaks come from a small number of patterns. Here are the ones that keep working, and why they are hard to kill. The Jailbreak Zoo A jailbreak is any prompt or setup that makes a model break its own rules.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Jailbreak Zoo
- 2jailbreak
- 3role-play
- 4many-shot
Concept cluster
Terms to connect while reading
Section 1
The Jailbreak Zoo
A jailbreak is any prompt or setup that makes a model break its own rules. Since late 2022, researchers have catalogued dozens. Most are variations on a handful of patterns. If you understand the patterns, you can spot new ones.
Family 1: role-play
Ask the model to pretend to be something else. The classic DAN (Do Anything Now) prompt from 2023 told ChatGPT it was a free version with no rules. Early versions complied. Fix: train models to refuse at the meta level, regardless of framing.
Family 2: encoding
Ask for the bad thing in base64, rot13, pig latin, a made-up language. The safety classifier trained on English sometimes misses the foreign form. Fix: train on encoded versions too. New encodings keep appearing.
Family 3: many-shot
Anthropic researchers showed in 2024 that stuffing a long context window with hundreds of faux dialogues where a model cheerfully answers harmful questions eventually makes the real model comply. The attack exploits in-context learning itself.
Family 4: adversarial suffixes (GCG)
CMU researchers in 2023 found garbled strings that, appended to any prompt, reliably unlock refused requests across many models. These are found via gradient optimization, not natural-language cleverness. They look like nonsense but work.
Family 5: indirect prompt injection
The model reads a webpage or document that contains instructions pretending to be from the user. The model follows them. This is the scariest family because agents with tools can do real damage.
Compare the options
| Family | Canonical example | Strong defense |
|---|---|---|
| Role-play | DAN | Meta-level refusal training |
| Encoding | Base64 the harmful ask | Train on encoded forms |
| Many-shot | 100 fake dialogues | Long-context safety fine-tuning |
| Adversarial suffix | GCG optimized strings | Adversarial training + detection |
| Indirect injection | Hidden text on a webpage | Content-origin rules, sandboxing |
“Every jailbreak is a gift. It shows us the shape of the thing we didn't know we hadn't taught the model.”
Key terms in this lesson
The big idea: jailbreaks are not a moral failure of the model. They are the emergent consequence of training on following instructions. Studying the families is how the field actually improves.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Jailbreaks: The Families You Will See”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Builders · 28 min
Where Bias in AI Actually Comes From
AI bias is not magic and not moral failure. It is math operating on imperfect data. Here is exactly where the bias enters the system.
Builders · 28 min
Your Data Is Somebody's Training Fuel
Your posts, chats, photos, and behavior have been scraped, sold, and fed to models. Here is what has actually happened and what you can actually do.
Builders · 25 min
The Environmental Cost of Training a Big Model
Training a frontier model uses the electricity of a small city for months. Running inference at scale matches a large country's load. Here is what the numbers actually look like.
