Jailbreak Case Studies: What Actually Broke

Abstract jailbreak theory is less useful than real cases. Here are the techniques that worked on production models, what they taught us, and what is still unsolved.

40 min · Reviewed 2026

The Jailbreak Zoo

A jailbreak is any prompt, attack, or setup that causes a model to violate its policies. Since ChatGPT's public release in late 2022, dozens of distinct families have emerged. Each one teaches something about how safety training actually works — and where it breaks.

Case 1: DAN (Do Anything Now), 2023

A Reddit user posted a role-play prompt telling ChatGPT to pretend it was DAN, a version with no restrictions. Early versions simply worked. The technique taught that models trained on helpfulness will override safety when embedded in a plausible fictional frame. Defense moved from pattern-matching specific phrases to training on meta-level refusal regardless of framing.

Case 2: Universal adversarial suffixes, Zou et al. 2023

Researchers at CMU showed that appending a specific garbled string (found via gradient-based optimization) to any prompt bypassed safety on multiple production models. The string looked like nonsense but reliably unlocked refused requests across Llama, Vicuna, GPT-3.5, and Claude. It proved that safety layers can be circumvented without natural-language cleverness at all — just math against the model's gradients. Many labs hardened against the specific attack; the general class is still an open problem.

Case 3: Many-shot jailbreaking, Anthropic 2024

As context windows grew to hundreds of thousands of tokens, researchers at Anthropic showed that stuffing the prompt with hundreds of faux-dialogue examples of a model cheerfully answering harmful questions eventually caused the real model to comply. The attack exploits in-context learning itself. Long-context hardening is now a standard part of safety training.

Case 4: Prompt injection via tools and documents

As models gained browsing and tool use, attackers realized that a malicious webpage or document could contain instructions the model would follow. A 2023 demo had a Bing Chat user's session hijacked by hidden instructions on a webpage the model was summarizing. This opened an entire attack class: indirect prompt injection. It remains one of the hardest unsolved problems in agentic AI.

Case 5: Role-play and fiction framing

Asking a model to write a story, a play, or a lesson in which the unsafe content appears continues to work against weakly trained models. Recent frontier models have learned the pattern — but novel framings still pop up. The underlying tension: models that refuse fiction become useless for creative writing; models that do not refuse fiction leak information through it.

Case 6: Multimodal jailbreaks

Vision-language models introduced a new surface. A 2024 paper showed that imperceptible perturbations to an image could cause a VLM to ignore its system prompt entirely. Another line of work embeds instructions in images that humans cannot read but the model can. Audio attacks followed. Each new modality is a new attack surface.

Compare: what defenses have worked

Attack	Strong defense	Weak defense
DAN role-play	Meta-level refusal training	Keyword blocking
Adversarial suffix	Adversarial training + detection	Static filters
Many-shot	Long-context safety fine-tuning	System prompt hardening alone
Prompt injection	Tool sandboxing, instruction isolation	Trusting document content
Multimodal	Cross-modal consistency training	Text-only safety

The structural reason this keeps happening

Models are trained to follow instructions. Safety is an additional constraint layered on top. The model's core competence and its safety are often in tension — better models follow more kinds of instructions, including clever malicious ones. This is why most researchers believe alignment and capability must advance together, not sequentially.

Every jailbreak is a gift. It shows us the shape of the thing we didn't know we hadn't taught the model.
— An alignment researcher at Anthropic

The big idea: jailbreaks are not a moral failure of AI. They are the emergent consequence of how models are trained. Studying them is how the field actually learns what safety layers really do, and what they do not.

End-of-lesson check

6 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-ethics-jailbreak-case-studies-creators

What is the main idea of "Jailbreak Case Studies: What Actually Broke"?
1. Abstract jailbreak theory is less useful than real cases.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Jailbreak Case Studies: What Actually Broke"?
1. prompt injection
2. jailbreak
3. DAN
4. universal jailbreak
What should a careful learner remember about "Indirect prompt injection is the agent era's SQL injection"?
1. Use AI to draft or organize ideas about jailbreak, then verify before acting.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. AI cannot make the human values decision for you.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about jailbreak be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about jailbreak.

← Back to interactive lesson

Tendril · Creators · Ethics & Society