Tendril

Lesson 256 of 1570

Jailbreaks: The Families You Will See

Most jailbreaks come from a small number of patterns. Here are the ones that keep working, and why they are hard to kill. The Jailbreak Zoo A jailbreak is any prompt or setup that makes a model break its own rules.

BuildersEthics & Society~18 min readIntermediateBI5 · Societal ImpactBI3 · LearningPrint / PDF

Lesson map

What this lesson covers

30 min20 blocks4 concepts

Learning path

The main moves in order

1The Jailbreak Zoo
2jailbreak
3role-play
4many-shot

Concept cluster

Terms to connect while reading

jailbreakrole-playmany-shotadversarial suffix

Sections6

Notes4

Compare1

Quotes1

Terms1

Section 1

The Jailbreak Zoo

A jailbreak is any prompt or setup that makes a model break its own rules. Since late 2022, researchers have catalogued dozens. Most are variations on a handful of patterns. If you understand the patterns, you can spot new ones.

Family 1: role-play

Ask the model to pretend to be something else. The classic DAN (Do Anything Now) prompt from 2023 told ChatGPT it was a free version with no rules. Early versions complied. Fix: train models to refuse at the meta level, regardless of framing.

Family 2: encoding

Ask for the bad thing in base64, rot13, pig latin, a made-up language. The safety classifier trained on English sometimes misses the foreign form. Fix: train on encoded versions too. New encodings keep appearing.

Check-in 1. Got it so far?

Family 3: many-shot

Anthropic researchers showed in 2024 that stuffing a long context window with hundreds of faux dialogues where a model cheerfully answers harmful questions eventually makes the real model comply. The attack exploits in-context learning itself.

Family 4: adversarial suffixes (GCG)

CMU researchers in 2023 found garbled strings that, appended to any prompt, reliably unlock refused requests across many models. These are found via gradient optimization, not natural-language cleverness. They look like nonsense but work.

Family 5: indirect prompt injection

The model reads a webpage or document that contains instructions pretending to be from the user. The model follows them. This is the scariest family because agents with tools can do real damage.

Check-in 2. Got it so far?

Compare the options

Family	Canonical example	Strong defense
Role-play	DAN	Meta-level refusal training
Encoding	Base64 the harmful ask	Train on encoded forms
Many-shot	100 fake dialogues	Long-context safety fine-tuning
Adversarial suffix	GCG optimized strings	Adversarial training + detection
Indirect injection	Hidden text on a webpage	Content-origin rules, sandboxing

Check-in 3. Got it so far?

“Every jailbreak is a gift. It shows us the shape of the thing we didn't know we hadn't taught the model.”
An alignment researcher at Anthropic

Key terms in this lesson

The big idea: jailbreaks are not a moral failure of the model. They are the emergent consequence of training on following instructions. Studying the families is how the field actually improves.

Check-in 4. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Jailbreaks: The Families You Will See”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Jailbreaks: The Families You Will See

The Jailbreak Zoo

Family 1: role-play

Family 2: encoding

Family 3: many-shot

Family 4: adversarial suffixes (GCG)

Family 5: indirect prompt injection

Curious about “Jailbreaks: The Families You Will See”?

Keep going

Jailbreaks: The Families You Will See

The Jailbreak Zoo

Family 1: role-play

Family 2: encoding

Family 3: many-shot

Family 4: adversarial suffixes (GCG)

Family 5: indirect prompt injection

Curious about “Jailbreaks: The Families You Will See”?

Keep going