Loading lesson…
Labs run dangerous-capability evaluations before release. Which results go public, and which stay private? The line is moving, and it matters.
Modern frontier labs run suites of evaluations before releasing a model. These include general capability benchmarks (MMLU, GPQA, SWE-bench), alignment benchmarks (HHH, TruthfulQA), and most importantly dangerous-capability evaluations — tests designed to probe harms.
We're in the middle of a collective negotiation about what evaluation results mean, who gets to run them, and what counts as failing.
— Beth Barnes, METR (paraphrased)
The big idea: evaluations are becoming the main accountability surface for AI. What counts as a real evaluation, and who trusts the numbers, is where the next policy fights live.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-safety-evals-disclosure-creators
What is the primary purpose of dangerous-capability evaluations before an AI model is released to the public?
Which organization is explicitly mentioned as running autonomy evaluations for AI models?
Why do AI labs typically keep the specific prompts that successfully elicit dangerous outputs confidential?
In AI safety testing, what does the term 'sandbagging' refer to?
What is the central trade-off discussed regarding the disclosure of evaluation prompts and data?
Which of the following is NOT listed in the lesson as a common dangerous-capability evaluation category?
What does an 'uplift study' specifically measure in AI safety evaluation?
Why might negative safety results that did not block a model's release be inconsistently disclosed?
What does 'structured access' mean in the context of AI safety evaluation disclosure?
What advantage do third-party evaluators like Apollo Research provide over internal lab testing?
Which of the following is specifically identified as an alignment benchmark in the lesson?
When the lesson mentions that specific uplift measurements are 'partially' disclosed, what does this typically mean in practice?
What concern does the lesson identify about the evaluation process itself?
Which organization is mentioned as focusing specifically on 'scheming' behavior in AI models?
What is the primary function of a system card that accompanies an AI model release?