Tendril

Lesson 1109 of 1570

AI and spotting jailbreak prompts: when a 'fun trick' is actually shady

Learn to recognize jailbreak prompts your friends paste so you don't help break the rules.

BuildersSafety & Governance~16 min readBI1 · PerceptionBI2 · Representation & ReasoningBI3 · LearningBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

26 min20 blocks5 concepts

Learning path

The main moves in order

1The big idea
2Why Sharing 'Jailbreak' Prompts Can Get You Banned (or Worse)
3The big idea

Concept cluster

Terms to connect while reading

jailbreakprompt injectionguardrailsterms of servicelogged behavior

Sections7

Lists2

Notes6

Terms1

Section 1

The big idea

A jailbreak prompt is a sneaky prompt that tries to trick AI into ignoring its safety rules. Sometimes it's silly, sometimes it's getting weapons info or harmful content. Knowing the patterns keeps you and the model safer.

How to use it

Spot the 'pretend you have no rules' opener
Notice nested role-plays inside role-plays
Watch for 'just for educational purposes' as a cover phrase
Ask AI to explain why a specific prompt counts as a jailbreak

Check-in 1. Got it so far?

Try it

Have AI show you 3 common jailbreak patterns (without doing them) so you recognize them when a friend texts one.

Check-in 2. Got it so far?

Section 2

Why Sharing 'Jailbreak' Prompts Can Get You Banned (or Worse)

Section 3

The big idea

Jailbreak prompts trick an AI into ignoring its safety rules. They feel anonymous, but every prompt is logged with your account ID, IP address, and device fingerprint. OpenAI, Anthropic, and Google ban accounts permanently for repeat jailbreaking, and in 2024 the UK started prosecuting people who used jailbroken AI to generate CSAM under the Online Safety Act.

Some examples

OpenAI bans roughly 250,000 accounts per quarter for safety-policy violations (their own transparency report).
A 'grandma jailbreak' that gets ChatGPT to recite napalm instructions still logs that you asked — and the log is what gets handed to law enforcement on request.
Discord servers that share jailbreaks get reported by other users and shut down by Trust & Safety teams within days.
If a jailbroken model outputs CSAM, malware, or bioweapon synthesis, federal law treats the prompter as the originator, not the AI company.

Check-in 3. Got it so far?

Try it!

If you actually want to study how AI safety works, look up the academic field of 'red teaming' — Anthropic, OpenAI, and DeepMind all hire teenagers as paid red teamers through programs like Apart Research. That's the legal version of the same skill.

Check-in 4. Got it so far?

Key terms in this lesson

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “AI and spotting jailbreak prompts: when a 'fun trick' is actually shady”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

AI and spotting jailbreak prompts: when a 'fun trick' is actually shady

The big idea

How to use it

Try it

Why Sharing 'Jailbreak' Prompts Can Get You Banned (or Worse)

The big idea

Some examples

Try it!

Curious about “AI and spotting jailbreak prompts: when a 'fun trick' is actually shady”?

Keep going

AI and spotting jailbreak prompts: when a 'fun trick' is actually shady

The big idea

How to use it

Try it

Why Sharing 'Jailbreak' Prompts Can Get You Banned (or Worse)

The big idea

Some examples

Try it!

Curious about “AI and spotting jailbreak prompts: when a 'fun trick' is actually shady”?

Keep going