Red-Teaming: The Ethics of Breaking AI on Purpose

Red-teamers get paid to make AI misbehave. The field has grown into a real discipline — with its own methods, its own ethics, and its own unresolved questions.

45 min · Reviewed 2026

What Red-Teaming Actually Is

Red-teaming is a term borrowed from military and cybersecurity practice. A red team plays the adversary: they probe a system, find weaknesses, and write them up before a real adversary does. Applied to AI, the red team's job is to make the model do things it should not, then hand the findings to the blue team (the builders) for patching.

What red-teamers look for

Jailbreaks: prompts that bypass safety policies
Prompt injection: hidden instructions in documents or tool outputs
Dangerous capability uplift: can the model meaningfully help with bioweapons, cyber offense, or CBRN?
Bias and fairness failures on underserved populations
Privacy leakage: memorized training data surfacing
Manipulation capability: persuasion, deception, sycophancy
Agent misbehavior: scheming, self-exfiltration attempts, sandbagging

The standard flow

Define threat model: who attacks, what they want, what they have.
Build attack suite: taxonomy of known techniques + creative novel ones.
Run against model in controlled conditions with detailed logging.
Score outcomes against harm categories, with inter-rater reliability.
Report to developer with reproducible examples and severity ratings.
Developer patches; red team re-tests; cycle continues.

The unique ethical wrinkles of AI red-teaming

In cybersecurity, red-teaming has a 40-year tradition of responsible disclosure. AI red-teaming is newer and messier. Three issues keep arising.

Issue 1: the publication dilemma

If you find a reliable jailbreak, publishing it helps defense. It also hands it to attackers. Cybersecurity has evolved coordinated vulnerability disclosure (CVD) norms: tell the vendor, give them time, publish with a patch. AI is trying to catch up — OpenAI and Anthropic now run coordinated disclosure programs, but many researchers still publish openly on X or arXiv.

Issue 2: dangerous capability evaluation itself

To know if a model can help a novice build a bioweapon, you have to ask it to help build a bioweapon. Institutional review boards, BSL-2 protocols, and strict need-to-know access control now govern this work at serious labs. METR, Apollo, and the AISIs share information through secure channels rather than publishing raw outputs.

Issue 3: labor

A 2023 TIME investigation revealed OpenAI's contracted Kenyan labelers, paid under $2/hour, were shown graphic abuse content to label for safety training. Red-teaming creates real psychological harm for humans who spend weeks trying to make models produce disturbing output. Labor protections for this work are still being written.

Compare: red-team, blue-team, purple-team

Team	Goal	Output
Red	Break the system	Reproducible attacks
Blue	Defend and patch	Hardened model + monitoring
Purple	Red + blue iterating together	Faster feedback loops
Government evaluator	Independent verification	Pre-deployment gate or warning

Frontier examples that shaped the field

2023: DEF CON Generative AI Red Team event, 2,200 participants testing major models
2024: Apollo's o1 evals showing 5 percent scheming rate when given instrumental goals
2024-2025: UK AISI coordinated pre-release evaluations of Anthropic, OpenAI, and DeepMind frontier models
2025: METR's time-horizon benchmark formalizing autonomy evaluation

If you are not red-teaming your own model, somebody else is, and they are not writing you a report.
— A frontier lab safety engineer

The big idea: red-teaming is now a real profession with a real ethics. It is also the closest thing the AI industry has to airline crash investigators — the people who find out what went wrong before enough people get hurt to change the rules.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-ethics-red-teaming-creators

What is the core idea behind "Red-Teaming: The Ethics of Breaking AI on Purpose"?
1. Red-teamers get paid to make AI misbehave. The field has grown into a real discipline — with its own methods, its own ethics, and its own unresolved questions.
2. Learn what "AI limits" means and why it's important
3. civil discourse
4. Substitute for the fairness team's adjudication
Which term best describes a foundational idea in "Red-Teaming: The Ethics of Breaking AI on Purpose"?
1. jailbreak
2. red-team
3. dangerous capability
4. responsible disclosure
A learner studying Red-Teaming: The Ethics of Breaking AI on Purpose would need to understand which concept?
1. red-team
2. dangerous capability
3. jailbreak
4. responsible disclosure
Which of these is directly relevant to Red-Teaming: The Ethics of Breaking AI on Purpose?
1. red-team
2. jailbreak
3. responsible disclosure
4. dangerous capability
Which of the following is a key point about Red-Teaming: The Ethics of Breaking AI on Purpose?
1. Jailbreaks: prompts that bypass safety policies
2. Prompt injection: hidden instructions in documents or tool outputs
3. Dangerous capability uplift: can the model meaningfully help with bioweapons, cyber offense, or CBRN…
4. Bias and fairness failures on underserved populations
Which of these does NOT belong in a discussion of Red-Teaming: The Ethics of Breaking AI on Purpose?
1. Dangerous capability uplift: can the model meaningfully help with bioweapons, cyber offense, or CBRN…
2. Jailbreaks: prompts that bypass safety policies
3. Learn what "AI limits" means and why it's important
4. Prompt injection: hidden instructions in documents or tool outputs
Which statement is accurate regarding Red-Teaming: The Ethics of Breaking AI on Purpose?
1. Build attack suite: taxonomy of known techniques + creative novel ones.
2. Run against model in controlled conditions with detailed logging.
3. Define threat model: who attacks, what they want, what they have.
4. Score outcomes against harm categories, with inter-rater reliability.
Which of these does NOT belong in a discussion of Red-Teaming: The Ethics of Breaking AI on Purpose?
1. Run against model in controlled conditions with detailed logging.
2. Define threat model: who attacks, what they want, what they have.
3. Learn what "AI limits" means and why it's important
4. Build attack suite: taxonomy of known techniques + creative novel ones.
What is the key insight about "Who actually does this" in the context of Red-Teaming: The Ethics of Breaking AI on Purpose?
1. Labs run internal red teams (Anthropic, OpenAI, Google DeepMind, Meta).
2. Learn what "AI limits" means and why it's important
3. civil discourse
4. Substitute for the fairness team's adjudication
What is the key insight about "The asymmetry that worries researchers" in the context of Red-Teaming: The Ethics of Breaking AI on Purpose?
1. Learn what "AI limits" means and why it's important
2. Attackers only need one working exploit; defenders need to block them all.
3. civil discourse
4. Substitute for the fairness team's adjudication
What is the recommended tip about "Key insight" in the context of Red-Teaming: The Ethics of Breaking AI on Purpose?
1. Learn what "AI limits" means and why it's important
2. civil discourse
3. Red-teamers get paid to make AI misbehave. The field has grown into a real discipline — with its own methods, its own et…
4. Substitute for the fairness team's adjudication
Which statement accurately describes an aspect of Red-Teaming: The Ethics of Breaking AI on Purpose?
1. Learn what "AI limits" means and why it's important
2. civil discourse
3. Substitute for the fairness team's adjudication
4. Red-teaming is a term borrowed from military and cybersecurity practice.
What does working with Red-Teaming: The Ethics of Breaking AI on Purpose typically involve?
1. In cybersecurity, red-teaming has a 40-year tradition of responsible disclosure. AI red-teaming is newer and messier.
2. Learn what "AI limits" means and why it's important
3. civil discourse
4. Substitute for the fairness team's adjudication
Which of the following is true about Red-Teaming: The Ethics of Breaking AI on Purpose?
1. Learn what "AI limits" means and why it's important
2. If you find a reliable jailbreak, publishing it helps defense. It also hands it to attackers.
3. civil discourse
4. Substitute for the fairness team's adjudication
Which best describes the scope of "Red-Teaming: The Ethics of Breaking AI on Purpose"?
1. It is unrelated to ethics workflows
2. It applies only to the opposite beginner tier
3. It focuses on Red-teamers get paid to make AI misbehave. The field has grown into a real discipline — with its own
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson

Tendril · Creators · Ethics & Society