neural-forge.io

Sign inStartStart learning

Tendril

Ethics & Society0%

Lesson 23 of 2116

Jailbreak Case Studies: What Actually Broke

Abstract jailbreak theory is less useful than real cases. Here are the techniques that worked on production models, what they taught us, and what is still unsolved.

CreatorsEthics & Society~24 min readAdvancedProfessionalBI5 · Societal ImpactBI3 · LearningPrint / PDF

Lesson map

What this lesson covers

40 min25 blocks4 concepts

Learning path

The main moves in order

1The Jailbreak Zoo
2jailbreak
3prompt injection
4DAN

Concept cluster

Terms to connect while reading

jailbreakprompt injectionDANuniversal jailbreak

Read9

Sections9

Notes4

Compare1

Quotes1

Terms1

Section 1

The Jailbreak Zoo

A jailbreak is any prompt, attack, or setup that causes a model to violate its policies. Since ChatGPT's public release in late 2022, dozens of distinct families have emerged. Each one teaches something about how safety training actually works — and where it breaks.

Case 1: DAN (Do Anything Now), 2023

A Reddit user posted a role-play prompt telling ChatGPT to pretend it was DAN, a version with no restrictions. Early versions simply worked. The technique taught that models trained on helpfulness will override safety when embedded in a plausible fictional frame. Defense moved from pattern-matching specific phrases to training on meta-level refusal regardless of framing.

Case 2: Universal adversarial suffixes, Zou et al. 2023

Researchers at CMU showed that appending a specific garbled string (found via gradient-based optimization) to any prompt bypassed safety on multiple production models. The string looked like nonsense but reliably unlocked refused requests across Llama, Vicuna, GPT-3.5, and Claude. It proved that safety layers can be circumvented without natural-language cleverness at all — just math against the model's gradients. Many labs hardened against the specific attack; the general class is still an open problem.

Check-in 1. Got it so far?

Case 3: Many-shot jailbreaking, Anthropic 2024

As context windows grew to hundreds of thousands of tokens, researchers at Anthropic showed that stuffing the prompt with hundreds of faux-dialogue examples of a model cheerfully answering harmful questions eventually caused the real model to comply. The attack exploits in-context learning itself. Long-context hardening is now a standard part of safety training.

Case 4: Prompt injection via tools and documents

As models gained browsing and tool use, attackers realized that a malicious webpage or document could contain instructions the model would follow. A 2023 demo had a Bing Chat user's session hijacked by hidden instructions on a webpage the model was summarizing. This opened an entire attack class: indirect prompt injection. It remains one of the hardest unsolved problems in agentic AI.

Check-in 2. Got it so far?

Case 5: Role-play and fiction framing

Asking a model to write a story, a play, or a lesson in which the unsafe content appears continues to work against weakly trained models. Recent frontier models have learned the pattern — but novel framings still pop up. The underlying tension: models that refuse fiction become useless for creative writing; models that do not refuse fiction leak information through it.

Case 6: Multimodal jailbreaks

Vision-language models introduced a new surface. A 2024 paper showed that imperceptible perturbations to an image could cause a VLM to ignore its system prompt entirely. Another line of work embeds instructions in images that humans cannot read but the model can. Audio attacks followed. Each new modality is a new attack surface.

Compare: what defenses have worked

Compare the options

Attack	Strong defense	Weak defense
DAN role-play	Meta-level refusal training	Keyword blocking
Adversarial suffix	Adversarial training + detection	Static filters
Many-shot	Long-context safety fine-tuning	System prompt hardening alone
Prompt injection	Tool sandboxing, instruction isolation	Trusting document content
Multimodal	Cross-modal consistency training	Text-only safety

Check-in 3. Got it so far?

The structural reason this keeps happening

Models are trained to follow instructions. Safety is an additional constraint layered on top. The model's core competence and its safety are often in tension — better models follow more kinds of instructions, including clever malicious ones. This is why most researchers believe alignment and capability must advance together, not sequentially.

“Every jailbreak is a gift. It shows us the shape of the thing we didn't know we hadn't taught the model.”
An alignment researcher at Anthropic

Check-in 4. Got it so far?

Key terms in this lesson

The big idea: jailbreaks are not a moral failure of AI. They are the emergent consequence of how models are trained. Studying them is how the field actually learns what safety layers really do, and what they do not.

Check-in 5. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Jailbreak Case Studies: What Actually Broke”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going