Tendril

Lesson 253 of 2116

Private vs. Public Evaluations

Public benchmarks get gamed. Private evaluations tell the truth but cannot be checked. Where is the balance? Third-party evaluators Organizations like METR (formerly ARC Evals) and the UK AI Safety Institute run closed evaluations on frontier models.

CreatorsAI Foundations~21 min readAdvancedBI3 · LearningBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

35 min15 blocks3 concepts

Learning path

The main moves in order

1The Transparency Trade-off
2private eval
3held-out
4third-party evaluation

Concept cluster

Terms to connect while reading

private evalheld-outthird-party evaluation

Sections3

Lists2

Notes4

Compare1

Quotes1

Section 1

The Transparency Trade-off

Public benchmarks are verifiable — anyone can reproduce the score. Private benchmarks are trustworthy — they cannot be gamed or leaked. Every evaluation regime must trade these two goods off.

Compare the options

Eval type	Transparency	Trustworthiness
Fully public (MMLU, HumanEval)	High	Degrades with time
Held-out test (kept private by authors)	Medium	High at launch, degrades as items leak
Blind third-party (METR, UK AISI)	Low	Very high
Live human eval (Arena)	Medium	High; cannot memorize what is unknown

Third-party evaluators

Organizations like METR (formerly ARC Evals) and the UK AI Safety Institute run closed evaluations on frontier models. They are trusted precisely because their prompts are not public. But you, the outsider, must take their reports on faith.

Check-in 1. Got it so far?

METR: evaluates dangerous capabilities and autonomy
UK AISI: government body, access to pre-release models
Apollo Research: focuses on deception and scheming
NIST AISIC (US): emerging public-sector evaluator

How to design your own held-out set

1Collect 50-200 real tasks from your own workflow
2Grade each with explicit rubric (not gut feeling)
3Never post the prompts or the rubric online
4Re-run on every new model release
5Keep it versioned so results are comparable over time

Check-in 2. Got it so far?

“Public evaluations do not tell us about frontier capabilities. Private evaluations do, but nobody can check them.”
A policy researcher at a frontier AI lab

Key terms in this lesson

Check-in 3. Got it so far?

The big idea: the healthiest evaluation ecosystem mixes public (verifiable), private (trustworthy), and your own (relevant) measurements.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Private vs. Public Evaluations”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Private vs. Public Evaluations

The Transparency Trade-off

Third-party evaluators

How to design your own held-out set

Curious about “Private vs. Public Evaluations”?

Keep going

Private vs. Public Evaluations

The Transparency Trade-off

Third-party evaluators

How to design your own held-out set

Curious about “Private vs. Public Evaluations”?

Keep going