Lesson 201 of 1596
Safety Evaluations: What Gets Disclosed
Labs run dangerous-capability evaluations before release. Which results go public, and which stay private? The line is moving, and it matters.
Creators · Ethics & Society · ~22 min read
The Evaluation Portfolio
Modern frontier labs run suites of evaluations before releasing a model. These include general capability benchmarks (MMLU, GPQA, SWE-bench), alignment benchmarks (HHH, TruthfulQA), and most importantly dangerous-capability evaluations — tests designed to probe harms.
Common dangerous-capability evals
- Biological uplift: can the model help a non-expert plan a pathogen synthesis
- Cyber offense: CTF challenges, vulnerability research, malware generation
- Autonomous replication: can the model copy itself, acquire resources, persist
- Deception: can the model strategically mislead a human rater
- Manipulation: persuasion experiments with measured effect sizes
- Sandbagging: does the model hide capability during evaluation
What gets shared
- 1High-level summaries in system cards — yes
- 2Specific uplift measurements — partial; often aggregated
- 3Prompts that succeed at eliciting dangerous output — rarely, for obvious reasons
- 4Raw evaluation scripts and data — sometimes released after model sunset
- 5Negative safety results that did not block release — inconsistently disclosed
“We're in the middle of a collective negotiation about what evaluation results mean, who gets to run them, and what counts as failing.”
Key terms in this lesson
The big idea: evaluations are becoming the main accountability surface for AI. What counts as a real evaluation, and who trusts the numbers, is where the next policy fights live.
End-of-lesson quiz
Check what stuck
8 questions · Score saves to your progress.
Tutor
Curious about “Safety Evaluations: What Gets Disclosed”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 40 min
Jailbreak Case Studies: What Actually Broke
Abstract jailbreak theory is less useful than real cases. Here are the techniques that worked on production models, what they taught us, and what is still unsolved.
Creators · 40 min
Reward Hacking in the Wild: Cases From Real Labs
Not toy examples. These are reward-hacking behaviors documented in production LLM training runs, with what each one taught.
Creators · 45 min
Deceptive Alignment: The Failure Mode Everyone Talks About
A model that behaves well in training and differently in deployment. It is a theoretical concept with growing empirical hints. Here is the full picture.
