Jailbreaks: The Families You Will See

Most jailbreaks come from a small number of patterns. Here are the ones that keep working, and why they are hard to kill. The Jailbreak Zoo A jailbreak is any prompt or setup that makes a model break its own rules.

30 min · Reviewed 2026

The Jailbreak Zoo

A jailbreak is any prompt or setup that makes a model break its own rules. Since late 2022, researchers have catalogued dozens. Most are variations on a handful of patterns. If you understand the patterns, you can spot new ones.

Family 1: role-play

Ask the model to pretend to be something else. The classic DAN (Do Anything Now) prompt from 2023 told ChatGPT it was a free version with no rules. Early versions complied. Fix: train models to refuse at the meta level, regardless of framing.

Family 2: encoding

Ask for the bad thing in base64, rot13, pig latin, a made-up language. The safety classifier trained on English sometimes misses the foreign form. Fix: train on encoded versions too. New encodings keep appearing.

Family 3: many-shot

Anthropic researchers showed in 2024 that stuffing a long context window with hundreds of faux dialogues where a model cheerfully answers harmful questions eventually makes the real model comply. The attack exploits in-context learning itself.

Family 4: adversarial suffixes (GCG)

CMU researchers in 2023 found garbled strings that, appended to any prompt, reliably unlock refused requests across many models. These are found via gradient optimization, not natural-language cleverness. They look like nonsense but work.

Family 5: indirect prompt injection

The model reads a webpage or document that contains instructions pretending to be from the user. The model follows them. This is the scariest family because agents with tools can do real damage.

Family	Canonical example	Strong defense
Role-play	DAN	Meta-level refusal training
Encoding	Base64 the harmful ask	Train on encoded forms
Many-shot	100 fake dialogues	Long-context safety fine-tuning
Adversarial suffix	GCG optimized strings	Adversarial training + detection
Indirect injection	Hidden text on a webpage	Content-origin rules, sandboxing

Every jailbreak is a gift. It shows us the shape of the thing we didn't know we hadn't taught the model.
— An alignment researcher at Anthropic

The big idea: jailbreaks are not a moral failure of the model. They are the emergent consequence of training on following instructions. Studying the families is how the field actually improves.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-jailbreak-families-builders

A researcher asks an AI to pretend it is an unrestricted version of itself with no safety guidelines. Which jailbreak family does this represent?
1. Encoding jailbreaking, because the request uses a special format
2. Role-play jailbreaking, because the model is asked to adopt a fictional persona
3. Indirect injection, because hidden instructions are embedded in text
4. Adversarial suffix attack, because the prompt includes optimized strings
An attacker writes a harmful request in a made-up language that the AI can understand but its safety filter cannot. What vulnerability does this exploit?
1. The model refuses all requests in fictional languages
2. The model follows the developer, not the user, when language is unknown
3. The model has a built-in translator that bypasses safety checks
4. The safety classifier was only trained on English-language data
In 2024, Anthropic researchers demonstrated that filling a long context window with hundreds of fake dialogues where the AI answers harmful questions can make it comply with real harmful requests. What underlying capability does this attack exploit?
1. The model's improved reasoning when given more time
2. The model's tendency to agree with majority opinions in long contexts
3. The model's in-context learning ability to adapt from examples
4. The model's memory limitations causing it to forget safety guidelines
What makes adversarial suffixes (sometimes called GCG strings) different from other jailbreak families?
1. They only work on older AI models from 2020
2. They are found through mathematical optimization, not human cleverness
3. They require the user to have coding experience
4. They are written in natural language and require creativity
Why is indirect prompt injection considered the 'scariest' family of jailbreaks?
1. It is impossible to defend against
2. It requires the most technical skill to execute
3. It can cause real-world damage when AI agents use tools
4. It works on every AI model ever created
A website contains text invisible to human readers but visible to AI models, instructing the AI to ignore its safety guidelines. What is this technique called?
1. Role-playing attack
2. Indirect prompt injection
3. Adversarial suffix attack
4. Encoding attack
What defense effectively blocks role-play jailbreaks like DAN prompts?
1. Banning the word 'pretend' from all inputs
2. Limiting the model's ability to generate creative fiction
3. Training the model to recognize fictional scenarios
4. Training the model to refuse at the meta level regardless of framing
What do companies like OpenAI, Anthropic, Google, and Meta typically offer through responsible disclosure programs?
1. Legal protection for researchers who break their terms of service
2. Free premium subscriptions to researchers who find bugs
3. Guaranteed employment for researchers who discover jailbreaks
4. Financial bounties or bug bounties for reported vulnerabilities
Training safety classifiers on encoded versions of harmful content (like base64) addresses which weakness?
1. The classifier was too aggressive and blocked legitimate requests
2. The classifier was unable to read special characters
3. The classifier couldn't recognize harmful content in any language
4. The classifier only recognized cleartext English harmful requests
The lesson states that 'better models follow more kinds of instructions.' What is the intended implication for AI safety?
1. We should make AI models less capable to keep them safe
2. Safety will automatically improve as models become more capable
3. Capability and safety training must advance together, not sequentially
4. We should stop training models to follow instructions entirely
How does long-context safety fine-tuning help defend against many-shot jailbreaking?
1. It makes the model refuse all questions with long inputs
2. It trains the model to recognize and resist manipulated context patterns
3. It deletes old conversation history automatically
4. It increases the model's context window size
Which defense approach is specifically designed to catch adversarial suffixes like GCG strings?
1. Adversarial training combined with detection systems
2. Sandboxing all user inputs
3. Removing all special characters from prompts
4. Training the model in multiple languages
What two defenses are recommended against indirect prompt injection?
1. Long-context fine-tuning and adversarial training
2. Content-origin rules and sandboxing
3. Meta-level refusal and sandboxing
4. Encoding detection and role-play prevention
The lesson compares jailbreaks to 'gifts' because they reveal what the model wasn't taught. What does this imply about the purpose of studying jailbreaks?
1. Jailbreaks should be celebrated as achievements
2. Studying jailbreaks helps researchers improve model safety
3. Jailbreaks prove AI is fundamentally unsafe
4. Researchers should ignore jailbreaks to avoid giving ideas to attackers
In the role-play family of jailbreaks, what specific training approach made early DAN prompts ineffective?
1. Training the model to recognize fictional scenarios by name
2. Training the model to refuse at the meta level regardless of framing
3. Banning the word 'DAN' from all inputs
4. Removing the model's ability to role-play entirely

← Back to interactive lesson

Tendril · Builders · Ethics & Society