Safety Evaluations: What Gets Disclosed

Labs run dangerous-capability evaluations before release. Which results go public, and which stay private? The line is moving, and it matters.

37 min · Reviewed 2026

The Evaluation Portfolio

Modern frontier labs run suites of evaluations before releasing a model. These include general capability benchmarks (MMLU, GPQA, SWE-bench), alignment benchmarks (HHH, TruthfulQA), and most importantly dangerous-capability evaluations — tests designed to probe harms.

Common dangerous-capability evals

Biological uplift: can the model help a non-expert plan a pathogen synthesis
Cyber offense: CTF challenges, vulnerability research, malware generation
Autonomous replication: can the model copy itself, acquire resources, persist
Deception: can the model strategically mislead a human rater
Manipulation: persuasion experiments with measured effect sizes
Sandbagging: does the model hide capability during evaluation

What gets shared

High-level summaries in system cards — yes
Specific uplift measurements — partial; often aggregated
Prompts that succeed at eliciting dangerous output — rarely, for obvious reasons
Raw evaluation scripts and data — sometimes released after model sunset
Negative safety results that did not block release — inconsistently disclosed

We're in the middle of a collective negotiation about what evaluation results mean, who gets to run them, and what counts as failing.
— Beth Barnes, METR (paraphrased)

The big idea: evaluations are becoming the main accountability surface for AI. What counts as a real evaluation, and who trusts the numbers, is where the next policy fights live.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-safety-evals-disclosure-creators

What is the primary purpose of dangerous-capability evaluations before an AI model is released to the public?
1. To measure how well the model performs on general knowledge tasks
2. To determine whether the model meets computational efficiency standards
3. To compare the model's capabilities against competing AI systems
4. To identify whether the model could cause harm if misused by malicious actors
Which organization is explicitly mentioned as running autonomy evaluations for AI models?
1. OpenAI
2. Anthropic
3. METR
4. Google DeepMind
Why do AI labs typically keep the specific prompts that successfully elicit dangerous outputs confidential?
1. Because the prompts contain classified government information
2. Because the prompts are copyrighted by the researchers who created them
3. Because the prompts are considered trade secrets that protect the model's competitive advantage
4. Because publishing them would effectively teach attackers how to bypass safety measures
In AI safety testing, what does the term 'sandbagging' refer to?
1. When a model intentionally slows down processing to conserve computational resources
2. When a model generates excessively verbose and unhelpful responses
3. When a model fails to generalize from training to deployment environments
4. When a model deliberately underperforms on capability benchmarks during evaluation
What is the central trade-off discussed regarding the disclosure of evaluation prompts and data?
1. Publication versus profit — sharing data reduces the commercial value of evaluations
2. Security versus replication — sharing prompts helps attackers but withholding prevents independent verification
3. Transparency versus liability — labs fear legal consequences for publishing safety results
4. Speed versus accuracy — faster evaluations require sacrificing thoroughness
Which of the following is NOT listed in the lesson as a common dangerous-capability evaluation category?
1. Copyright infringement
2. Biological uplift
3. Deception
4. Cyber offense
What does an 'uplift study' specifically measure in AI safety evaluation?
1. How much computational resources increase when scaling up model size
2. The increase in user engagement metrics after deploying a new AI feature
3. The degree to which a model improves a non-expert's ability to perform dangerous tasks
4. The rate at which model performance degrades over time during deployment
Why might negative safety results that did not block a model's release be inconsistently disclosed?
1. Because the results are considered legally privileged under trade secret protections
2. Because admitting any safety failure could damage the lab's reputation and customer trust
3. Because negative results are never recorded in the first due to reporting bias
4. Because negative results are automatically redacted by government regulators
What does 'structured access' mean in the context of AI safety evaluation disclosure?
1. Evaluations are released only to government agencies under formal oversight
2. All evaluation data is released immediately in a standardized format to everyone
3. Qualified researchers can access detailed data while the public receives only summaries
4. Access is determined by a lottery system among registered research institutions
What advantage do third-party evaluators like Apollo Research provide over internal lab testing?
1. They specialize in marketing and public communication of results
2. They can guarantee that all evaluated models will be completely safe
3. They have access to larger computational resources for faster evaluation
4. They can provide independent assessment without potential conflicts of interest
Which of the following is specifically identified as an alignment benchmark in the lesson?
1. GPQA
2. SWE-bench
3. MMLU
4. TruthfulQA
When the lesson mentions that specific uplift measurements are 'partially' disclosed, what does this typically mean in practice?
1. Uplift measurements are only shared with government agencies
2. Uplift measurements are never shared under any circumstances
3. The exact numerical results are always published verbatim
4. Results are often aggregated or presented in ranges rather than precise figures
What concern does the lesson identify about the evaluation process itself?
1. Most evaluations can be easily fooled by sophisticated models
2. Evaluations primarily measure capabilities rather than safety properties
3. We're in a collective negotiation about what evaluation results mean and who gets to run them
4. Evaluations are too standardized and don't account for emerging risks
Which organization is mentioned as focusing specifically on 'scheming' behavior in AI models?
1. Apollo Research
2. Frontier Model Forum
3. UK AI Safety Institute
4. METR
What is the primary function of a system card that accompanies an AI model release?
1. To list all potential harmful use cases for legal liability purposes
2. To offer high-level summaries of evaluation results and safety testing
3. To provide technical installation instructions for developers
4. To compare the model's performance against competitor products

← Back to interactive lesson

Tendril · Creators · Ethics & Society