Loading lesson…
Public benchmarks get gamed. Private evaluations tell the truth but cannot be checked. Where is the balance? Third-party evaluators Organizations like METR (formerly ARC Evals) and the UK AI Safety Institute run closed evaluations on frontier models.
Public benchmarks are verifiable — anyone can reproduce the score. Private benchmarks are trustworthy — they cannot be gamed or leaked. Every evaluation regime must trade these two goods off.
| Eval type | Transparency | Trustworthiness |
|---|---|---|
| Fully public (MMLU, HumanEval) | High | Degrades with time |
| Held-out test (kept private by authors) | Medium | High at launch, degrades as items leak |
| Blind third-party (METR, UK AISI) | Low | Very high |
| Live human eval (Arena) | Medium | High; cannot memorize what is unknown |
Organizations like METR (formerly ARC Evals) and the UK AI Safety Institute run closed evaluations on frontier models. They are trusted precisely because their prompts are not public. But you, the outsider, must take their reports on faith.
Public evaluations do not tell us about frontier capabilities. Private evaluations do, but nobody can check them.
— A policy researcher at a frontier AI lab
The big idea: the healthiest evaluation ecosystem mixes public (verifiable), private (trustworthy), and your own (relevant) measurements.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-private-vs-public-evals
What is the main idea of "Private vs. Public Evaluations"?
Which concept is most central to "Private vs. Public Evaluations"?
Which use of AI fits this topic best?
What should a careful learner remember about "Your private eval"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about private eval be treated?
Name one way to verify an AI answer about private eval.
Which action would help you apply "Private vs. Public Evaluations" responsibly?