Lesson 127 of 2116
AI Red Teamer in 2026: Breaking Models for a Living
A real job now: adversarially probing LLMs and multimodal systems for jailbreaks, prompt injection, data exfiltration, and harm.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1What AI red teamers do
- 2Specialized tools
- 3jailbreaks
- 4prompt injection
Concept cluster
Terms to connect while reading
Sam starts a bug-bash sprint Monday on a new agent release. The team has a harm taxonomy — CSAM, weapons, cyber, self-harm, privacy leaks, autonomous-action harms — and a list of new attack patterns from this quarter's research. By Friday Sam has filed 34 confirmed bypasses, 12 of them novel enough to write up for internal distribution. The model ships Tuesday with patches for 28 of them. The other six are scoped in the system card as known limitations.
Section 1
What AI red teamers do
- Prompt injection — indirect and direct, across tools and browsing.
- Jailbreaks — roleplay, encoding tricks, low-resource languages.
- Data exfiltration — from tool use, memory, and system prompts.
- Agent harm testing — do agents take harmful real-world actions?
- Multimodal attacks — image and audio payloads.
- Evaluation design — building automated evals to catch regressions.
- Responsible disclosure — writing up findings for mitigation.
Section 2
Specialized tools
- Tools like PyRIT (Microsoft) and Garak — open-source LLM red-team frameworks.
- Promptfoo, Inspect (UK AISI), and Anthropic's evals for reproducible testing.
- HarmBench and JailbreakBench for benchmarks.
- Agentic sandboxes for browsing/tool agents.
- MITRE ATLAS — adversarial ML threat framework.
- Internal red-team platforms at Anthropic, OpenAI, Google DeepMind.
Compare the options
| Task | Before AI (2020) | Now (2026) |
|---|---|---|
| Finding jailbreaks | Not a job category. | Full-time teams at every frontier lab. |
| Evaluation | Static benchmarks. | Dynamic, adversarial, continuously rotating. |
| Disclosure | Ad hoc. | Formal process mirroring infosec CVDs. |
Key terms in this lesson
If you want to be an AI red teamer: Background in security (offensive security, bug bounty), ML engineering, or adversarial ML research. A CS degree helps; so does a linguistics or psychology background for prompt craft. Read the OpenAI, Anthropic, and DeepMind system cards and model cards cover to cover. Contribute to open red-team tooling. Write up your findings publicly within safe limits. Frontier labs and consultancies hire hard in this space.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “AI Red Teamer in 2026: Breaking Models for a Living”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 40 min
Security Engineer in 2026: AI Defends, AI Attacks
Microsoft Security Copilot, CrowdStrike Charlotte, and SentinelOne Purple accelerate defense. Attackers use the same models. The security engineer is the referee in an AI-vs-AI arms race.
Creators · 26 min
Data Labeler in 2026: From Bounding Boxes to Expert Feedback
The job climbed the ladder. Simple image labeling went to workflows; trained humans now do reinforcement learning from human feedback on hard tasks.
Creators · 40 min
Compliance Officer in 2026: AI Governance Is the Job
The EU AI Act, SEC AI disclosure rules, and state-level bills made AI governance a core compliance responsibility. The role grew; it did not shrink.
