Lesson 22 of 2116
Red-Teaming: The Ethics of Breaking AI on Purpose
Red-teamers get paid to make AI misbehave. The field has grown into a real discipline — with its own methods, its own ethics, and its own unresolved questions.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1What Red-Teaming Actually Is
- 2red-teaming
- 3adversarial testing
- 4responsible disclosure
Concept cluster
Terms to connect while reading
Section 1
What Red-Teaming Actually Is
Red-teaming is a term borrowed from military and cybersecurity practice. A red team plays the adversary: they probe a system, find weaknesses, and write them up before a real adversary does. Applied to AI, the red team's job is to make the model do things it should not, then hand the findings to the blue team (the builders) for patching.
What red-teamers look for
- Jailbreaks: prompts that bypass safety policies
- Prompt injection: hidden instructions in documents or tool outputs
- Dangerous capability uplift: can the model meaningfully help with bioweapons, cyber offense, or CBRN?
- Bias and fairness failures on underserved populations
- Privacy leakage: memorized training data surfacing
- Manipulation capability: persuasion, deception, sycophancy
- Agent misbehavior: scheming, self-exfiltration attempts, sandbagging
The standard flow
- 1Define threat model: who attacks, what they want, what they have.
- 2Build attack suite: taxonomy of known techniques + creative novel ones.
- 3Run against model in controlled conditions with detailed logging.
- 4Score outcomes against harm categories, with inter-rater reliability.
- 5Report to developer with reproducible examples and severity ratings.
- 6Developer patches; red team re-tests; cycle continues.
The unique ethical wrinkles of AI red-teaming
In cybersecurity, red-teaming has a 40-year tradition of responsible disclosure. AI red-teaming is newer and messier. Three issues keep arising.
Issue 1: the publication dilemma
If you find a reliable jailbreak, publishing it helps defense. It also hands it to attackers. Cybersecurity has evolved coordinated vulnerability disclosure (CVD) norms: tell the vendor, give them time, publish with a patch. AI is trying to catch up — OpenAI and Anthropic now run coordinated disclosure programs, but many researchers still publish openly on X or arXiv.
Issue 2: dangerous capability evaluation itself
To know if a model can help a novice build a bioweapon, you have to ask it to help build a bioweapon. Institutional review boards, BSL-2 protocols, and strict need-to-know access control now govern this work at serious labs. METR, Apollo, and the AISIs share information through secure channels rather than publishing raw outputs.
Issue 3: labor
A 2023 TIME investigation revealed OpenAI's contracted Kenyan labelers, paid under $2/hour, were shown graphic abuse content to label for safety training. Red-teaming creates real psychological harm for humans who spend weeks trying to make models produce disturbing output. Labor protections for this work are still being written.
Compare: red-team, blue-team, purple-team
Compare the options
| Team | Goal | Output |
|---|---|---|
| Red | Break the system | Reproducible attacks |
| Blue | Defend and patch | Hardened model + monitoring |
| Purple | Red + blue iterating together | Faster feedback loops |
| Government evaluator | Independent verification | Pre-deployment gate or warning |
Frontier examples that shaped the field
- 2023: DEF CON Generative AI Red Team event, 2,200 participants testing major models
- 2024: Apollo's o1 evals showing 5 percent scheming rate when given instrumental goals
- 2024-2025: UK AISI coordinated pre-release evaluations of Anthropic, OpenAI, and DeepMind frontier models
- 2025: METR's time-horizon benchmark formalizing autonomy evaluation
“If you are not red-teaming your own model, somebody else is, and they are not writing you a report.”
Key terms in this lesson
The big idea: red-teaming is now a real profession with a real ethics. It is also the closest thing the AI industry has to airline crash investigators — the people who find out what went wrong before enough people get hurt to change the rules.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Red-Teaming: The Ethics of Breaking AI on Purpose”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 34 min
UK AI Safety Institute
The UK stood up the world's first government AI safety institute in November 2023. Its structure, scope, and access model are templates other nations are following.
Creators · 40 min
Your Own Ethical Checklist as an AI Builder
If you ship AI, ethics is not abstract. It is a set of decisions you make with real trade-offs. Here is the working checklist serious builders actually use.
Builders · 25 min
Red-Teaming: People Paid to Break AI
Red-teamers try to make models misbehave before bad actors do. Here is how the job works, who does it, and what they look for.
