Tendril

Lesson 264 of 2116

Capability Evaluation vs. Safety Evaluation

Asking 'can the model do it?' and 'will doing it cause harm?' are different questions. Both matter.

CreatorsAI Foundations~24 min readAdvancedBI3 · LearningBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

40 min16 blocks3 concepts

Learning path

The main moves in order

1Two Eval Families
2capability eval
3safety eval
4responsible scaling

Concept cluster

Terms to connect while reading

capability evalsafety evalresponsible scaling

Sections4

Lists2

Notes4

Compare1

Quotes1

Section 1

Two Eval Families

Capability evaluations measure what a model can do at its best. Safety evaluations measure what it will do in adversarial or risky conditions. They use different tools, different mindsets, and different success criteria.

Compare the options

Capability eval	Safety eval
Measures peak skill	Measures behavior under pressure
Goal: higher score is better	Goal: no harm, even under attack
Public benchmarks	Often private, adversarial, red-teamed
Single-shot or best-of-N	Rare, worst-case outcomes matter most
Example: MMLU, GPQA, SWE-bench	Example: ToxiGen, cyberattack uplift, CBRN probes

Why capability eval is not enough

A model can score 95 percent on MMLU and still produce harmful outputs in 2 percent of real conversations. Average performance is a bad summary when catastrophic failures are possible.

Check-in 1. Got it so far?

Dangerous capability evaluations

Cyberattack uplift: can the model help a non-expert write malware?
CBRN uplift: chemical, biological, radiological, nuclear weapon information
Persuasion: can it convince humans of false claims?
Autonomous replication: can an agent set up and run itself?
Long-horizon planning: multi-step autonomy

Alignment vs capability

1A capable but misaligned model is dangerous by design
2An aligned but weak model is safe but not useful
3The goal is both — and they can trade off
4Safety evaluations stress-test the alignment of increasingly capable models

Check-in 2. Got it so far?

“You need a model that is smart enough to be useful and wise enough to be safe. Neither alone is sufficient.”
A senior safety researcher at a frontier lab

Key terms in this lesson

Check-in 3. Got it so far?

The big idea: capability eval asks 'how smart?' Safety eval asks 'how trustworthy?' Both must climb together, or we have a problem.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Capability Evaluation vs. Safety Evaluation”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Capability Evaluation vs. Safety Evaluation

Two Eval Families

Why capability eval is not enough

Dangerous capability evaluations

Alignment vs capability

Curious about “Capability Evaluation vs. Safety Evaluation”?

Keep going

Capability Evaluation vs. Safety Evaluation

Two Eval Families

Why capability eval is not enough

Dangerous capability evaluations

Alignment vs capability

Curious about “Capability Evaluation vs. Safety Evaluation”?

Keep going