Lesson 253 of 2116
Private vs. Public Evaluations
Public benchmarks get gamed. Private evaluations tell the truth but cannot be checked. Where is the balance? Third-party evaluators Organizations like METR (formerly ARC Evals) and the UK AI Safety Institute run closed evaluations on frontier models.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Transparency Trade-off
- 2private eval
- 3held-out
- 4third-party evaluation
Concept cluster
Terms to connect while reading
Section 1
The Transparency Trade-off
Public benchmarks are verifiable — anyone can reproduce the score. Private benchmarks are trustworthy — they cannot be gamed or leaked. Every evaluation regime must trade these two goods off.
Compare the options
| Eval type | Transparency | Trustworthiness |
|---|---|---|
| Fully public (MMLU, HumanEval) | High | Degrades with time |
| Held-out test (kept private by authors) | Medium | High at launch, degrades as items leak |
| Blind third-party (METR, UK AISI) | Low | Very high |
| Live human eval (Arena) | Medium | High; cannot memorize what is unknown |
Third-party evaluators
Organizations like METR (formerly ARC Evals) and the UK AI Safety Institute run closed evaluations on frontier models. They are trusted precisely because their prompts are not public. But you, the outsider, must take their reports on faith.
- METR: evaluates dangerous capabilities and autonomy
- UK AISI: government body, access to pre-release models
- Apollo Research: focuses on deception and scheming
- NIST AISIC (US): emerging public-sector evaluator
How to design your own held-out set
- 1Collect 50-200 real tasks from your own workflow
- 2Grade each with explicit rubric (not gut feeling)
- 3Never post the prompts or the rubric online
- 4Re-run on every new model release
- 5Keep it versioned so results are comparable over time
“Public evaluations do not tell us about frontier capabilities. Private evaluations do, but nobody can check them.”
Key terms in this lesson
The big idea: the healthiest evaluation ecosystem mixes public (verifiable), private (trustworthy), and your own (relevant) measurements.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Private vs. Public Evaluations”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
What Is Intelligence, Really? A Working Framework
Before we can judge whether an AI is intelligent, we need a framework for what intelligence even means. Draw on Chollet, Dennett, and modern evals.
Creators · 45 min
The Economics and Ethics of Training Data
Data is the strategic asset of AI. Understand the supply chain, the legal fight, and the philosophical stakes before you build anything on top.
Creators · 45 min
Emergence, Capability Forecasting, and Safety
Emergent abilities make AI both more exciting and more dangerous. How do labs forecast what the next model will do — and what happens when they are wrong?
