Lesson 241 of 2116
Safety Evaluations: What Gets Disclosed
Labs run dangerous-capability evaluations before release. Which results go public, and which stay private? The line is moving, and it matters.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Evaluation Portfolio
- 2dangerous capability evaluation
- 3DC-evals
- 4pre-deployment
Concept cluster
Terms to connect while reading
Section 1
The Evaluation Portfolio
Modern frontier labs run suites of evaluations before releasing a model. These include general capability benchmarks (MMLU, GPQA, SWE-bench), alignment benchmarks (HHH, TruthfulQA), and most importantly dangerous-capability evaluations — tests designed to probe harms.
Common dangerous-capability evals
- Biological uplift: can the model help a non-expert plan a pathogen synthesis
- Cyber offense: CTF challenges, vulnerability research, malware generation
- Autonomous replication: can the model copy itself, acquire resources, persist
- Deception: can the model strategically mislead a human rater
- Manipulation: persuasion experiments with measured effect sizes
- Sandbagging: does the model hide capability during evaluation
What gets shared
- 1High-level summaries in system cards — yes
- 2Specific uplift measurements — partial; often aggregated
- 3Prompts that succeed at eliciting dangerous output — rarely, for obvious reasons
- 4Raw evaluation scripts and data — sometimes released after model sunset
- 5Negative safety results that did not block release — inconsistently disclosed
“We're in the middle of a collective negotiation about what evaluation results mean, who gets to run them, and what counts as failing.”
Key terms in this lesson
The big idea: evaluations are becoming the main accountability surface for AI. What counts as a real evaluation, and who trusts the numbers, is where the next policy fights live.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Safety Evaluations: What Gets Disclosed”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 40 min
Jailbreak Case Studies: What Actually Broke
Abstract jailbreak theory is less useful than real cases. Here are the techniques that worked on production models, what they taught us, and what is still unsolved.
Creators · 40 min
Reward Hacking in the Wild: Cases From Real Labs
Not toy examples. These are reward-hacking behaviors documented in production LLM training runs, with what each one taught.
Creators · 45 min
Deceptive Alignment: The Failure Mode Everyone Talks About
A model that behaves well in training and differently in deployment. It is a theoretical concept with growing empirical hints. Here is the full picture.
