Loading lesson…
Labs run dangerous-capability evaluations before release. Which results go public, and which stay private? The line is moving, and it matters.
Modern frontier labs run suites of evaluations before releasing a model. These include general capability benchmarks (MMLU, GPQA, SWE-bench), alignment benchmarks (HHH, TruthfulQA), and most importantly dangerous-capability evaluations — tests designed to probe harms.
We're in the middle of a collective negotiation about what evaluation results mean, who gets to run them, and what counts as failing.
— Beth Barnes, METR (paraphrased)
The big idea: evaluations are becoming the main accountability surface for AI. What counts as a real evaluation, and who trusts the numbers, is where the next policy fights live.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-safety-evals-disclosure-creators
What is the main idea of "Safety Evaluations: What Gets Disclosed"?
Which concept is most central to "Safety Evaluations: What Gets Disclosed"?
Which use of AI fits this topic best?
What should a careful learner remember about "METR, Apollo, and AISI"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about dangerous capability evaluation be treated?
Name one way to verify an AI answer about dangerous capability evaluation.
Which action would help you apply "Safety Evaluations: What Gets Disclosed" responsibly?