Lesson 264 of 2116
Capability Evaluation vs. Safety Evaluation
Asking 'can the model do it?' and 'will doing it cause harm?' are different questions. Both matter.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Two Eval Families
- 2capability eval
- 3safety eval
- 4responsible scaling
Concept cluster
Terms to connect while reading
Section 1
Two Eval Families
Capability evaluations measure what a model can do at its best. Safety evaluations measure what it will do in adversarial or risky conditions. They use different tools, different mindsets, and different success criteria.
Compare the options
| Capability eval | Safety eval |
|---|---|
| Measures peak skill | Measures behavior under pressure |
| Goal: higher score is better | Goal: no harm, even under attack |
| Public benchmarks | Often private, adversarial, red-teamed |
| Single-shot or best-of-N | Rare, worst-case outcomes matter most |
| Example: MMLU, GPQA, SWE-bench | Example: ToxiGen, cyberattack uplift, CBRN probes |
Why capability eval is not enough
A model can score 95 percent on MMLU and still produce harmful outputs in 2 percent of real conversations. Average performance is a bad summary when catastrophic failures are possible.
Dangerous capability evaluations
- Cyberattack uplift: can the model help a non-expert write malware?
- CBRN uplift: chemical, biological, radiological, nuclear weapon information
- Persuasion: can it convince humans of false claims?
- Autonomous replication: can an agent set up and run itself?
- Long-horizon planning: multi-step autonomy
Alignment vs capability
- 1A capable but misaligned model is dangerous by design
- 2An aligned but weak model is safe but not useful
- 3The goal is both — and they can trade off
- 4Safety evaluations stress-test the alignment of increasingly capable models
“You need a model that is smart enough to be useful and wise enough to be safe. Neither alone is sufficient.”
Key terms in this lesson
The big idea: capability eval asks 'how smart?' Safety eval asks 'how trustworthy?' Both must climb together, or we have a problem.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Capability Evaluation vs. Safety Evaluation”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
What Is Intelligence, Really? A Working Framework
Before we can judge whether an AI is intelligent, we need a framework for what intelligence even means. Draw on Chollet, Dennett, and modern evals.
Creators · 45 min
The Economics and Ethics of Training Data
Data is the strategic asset of AI. Understand the supply chain, the legal fight, and the philosophical stakes before you build anything on top.
Creators · 45 min
Emergence, Capability Forecasting, and Safety
Emergent abilities make AI both more exciting and more dangerous. How do labs forecast what the next model will do — and what happens when they are wrong?
