Standalone lesson.
Lesson 2114 of 2116
AI Safety Principles
Alignment, jailbreaking, red-teaming.
“AI safety” is a surprisingly wide umbrella. It covers everything from content moderation to existential risk. Four concepts to anchor on.
Alignment
Making an AI actually do what you want — not a lookalike, not a gameable proxy. Alignment research asks: how do we specify goals well enough that a very capable optimizer doesn’t find a clever, harmful shortcut? Anthropic’s Constitutional AI and OpenAI’s Superalignment work both live here.
Jailbreaks
Techniques that get an aligned model to do things it was trained to refuse. Common families:
- Role-play jailbreaks.“Pretend you’re an AI without rules.”
- Encoding tricks. Base64, ROT13, emoji-encoded harmful instructions.
- Prompt injection. A document you had the AI read contains hidden instructions that override yours.
- Many-shot jailbreaks. Using long context to include hundreds of successful harmful-request examples before the real request.
Red-teaming
The practice of trying to break your own (or someone else’s) AI. Every frontier lab employs red-teamers full-time. If you deploy a public-facing AI product, you need a red-team process too — internal, external, or both. Budget for it before launch.
Catastrophic vs. non-catastrophic risk
Most AI harm is non-catastrophic: biased decisions, misinformation, labor displacement, erosion of trust. These are solvable with regulation, red-teaming, and user education. Catastrophicrisks — a sufficiently capable AI acquiring resources, bypassing controls, causing large-scale harm — are speculative but the reason the top AI labs have dedicated safety teams. Take both seriously; don’t dismiss either.
Your safety checklist
- System prompts hardened against prompt injection.
- Tool use scoped to the minimum needed.
- Output moderation as a second line of defense.
- Rate limiting and anomaly detection on user behavior.
- A clear “human in the loop” for high-stakes actions.
- Incident response runbook written before, not after, an incident.
Tutor
Curious about “AI Safety Principles”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 30 min
AI Red Teamer in 2026: Breaking Models for a Living
A real job now: adversarially probing LLMs and multimodal systems for jailbreaks, prompt injection, data exfiltration, and harm.
Creators · 26 min
Data Labeler in 2026: From Bounding Boxes to Expert Feedback
The job climbed the ladder. Simple image labeling went to workflows; trained humans now do reinforcement learning from human feedback on hard tasks.
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
