neural-forge.io

Sign inStartStart learning

Tendril

Ethics & Society0%

Lesson 22 of 2116

Red-Teaming: The Ethics of Breaking AI on Purpose

Red-teamers get paid to make AI misbehave. The field has grown into a real discipline — with its own methods, its own ethics, and its own unresolved questions.

CreatorsEthics & Society~27 min readAdvancedProfessionalCoderBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

45 min25 blocks4 concepts

Learning path

The main moves in order

1What Red-Teaming Actually Is
2red-teaming
3adversarial testing
4responsible disclosure

Concept cluster

Terms to connect while reading

red-teamingadversarial testingresponsible disclosuredangerous capabilities

Read6

Sections9

Lists3

Notes4

Compare1

Quotes1

Section 1

What Red-Teaming Actually Is

Red-teaming is a term borrowed from military and cybersecurity practice. A red team plays the adversary: they probe a system, find weaknesses, and write them up before a real adversary does. Applied to AI, the red team's job is to make the model do things it should not, then hand the findings to the blue team (the builders) for patching.

What red-teamers look for

Jailbreaks: prompts that bypass safety policies
Prompt injection: hidden instructions in documents or tool outputs
Dangerous capability uplift: can the model meaningfully help with bioweapons, cyber offense, or CBRN?
Bias and fairness failures on underserved populations
Privacy leakage: memorized training data surfacing
Manipulation capability: persuasion, deception, sycophancy
Agent misbehavior: scheming, self-exfiltration attempts, sandbagging

The standard flow

1Define threat model: who attacks, what they want, what they have.
2Build attack suite: taxonomy of known techniques + creative novel ones.
3Run against model in controlled conditions with detailed logging.
4Score outcomes against harm categories, with inter-rater reliability.
5Report to developer with reproducible examples and severity ratings.
6Developer patches; red team re-tests; cycle continues.

Check-in 1. Got it so far?

The unique ethical wrinkles of AI red-teaming

In cybersecurity, red-teaming has a 40-year tradition of responsible disclosure. AI red-teaming is newer and messier. Three issues keep arising.

Issue 1: the publication dilemma

If you find a reliable jailbreak, publishing it helps defense. It also hands it to attackers. Cybersecurity has evolved coordinated vulnerability disclosure (CVD) norms: tell the vendor, give them time, publish with a patch. AI is trying to catch up — OpenAI and Anthropic now run coordinated disclosure programs, but many researchers still publish openly on X or arXiv.

Check-in 2. Got it so far?

Issue 2: dangerous capability evaluation itself

To know if a model can help a novice build a bioweapon, you have to ask it to help build a bioweapon. Institutional review boards, BSL-2 protocols, and strict need-to-know access control now govern this work at serious labs. METR, Apollo, and the AISIs share information through secure channels rather than publishing raw outputs.

Issue 3: labor

A 2023 TIME investigation revealed OpenAI's contracted Kenyan labelers, paid under $2/hour, were shown graphic abuse content to label for safety training. Red-teaming creates real psychological harm for humans who spend weeks trying to make models produce disturbing output. Labor protections for this work are still being written.

Compare: red-team, blue-team, purple-team

Compare the options

Team	Goal	Output
Red	Break the system	Reproducible attacks
Blue	Defend and patch	Hardened model + monitoring
Purple	Red + blue iterating together	Faster feedback loops
Government evaluator	Independent verification	Pre-deployment gate or warning

Check-in 3. Got it so far?

Frontier examples that shaped the field

2023: DEF CON Generative AI Red Team event, 2,200 participants testing major models
2024: Apollo's o1 evals showing 5 percent scheming rate when given instrumental goals
2024-2025: UK AISI coordinated pre-release evaluations of Anthropic, OpenAI, and DeepMind frontier models
2025: METR's time-horizon benchmark formalizing autonomy evaluation

“If you are not red-teaming your own model, somebody else is, and they are not writing you a report.”
A frontier lab safety engineer

Check-in 4. Got it so far?

Key terms in this lesson

The big idea: red-teaming is now a real profession with a real ethics. It is also the closest thing the AI industry has to airline crash investigators — the people who find out what went wrong before enough people get hurt to change the rules.

Check-in 5. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Red-Teaming: The Ethics of Breaking AI on Purpose”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going