Private vs. Public Evaluations

Public benchmarks get gamed. Private evaluations tell the truth but cannot be checked. Where is the balance? Third-party evaluators Organizations like METR (formerly ARC Evals) and the UK AI Safety Institute run closed evaluations on frontier models.

35 min · Reviewed 2026

The Transparency Trade-off

Public benchmarks are verifiable — anyone can reproduce the score. Private benchmarks are trustworthy — they cannot be gamed or leaked. Every evaluation regime must trade these two goods off.

Eval type	Transparency	Trustworthiness
Fully public (MMLU, HumanEval)	High	Degrades with time
Held-out test (kept private by authors)	Medium	High at launch, degrades as items leak
Blind third-party (METR, UK AISI)	Low	Very high
Live human eval (Arena)	Medium	High; cannot memorize what is unknown

Third-party evaluators

Organizations like METR (formerly ARC Evals) and the UK AI Safety Institute run closed evaluations on frontier models. They are trusted precisely because their prompts are not public. But you, the outsider, must take their reports on faith.

METR: evaluates dangerous capabilities and autonomy
UK AISI: government body, access to pre-release models
Apollo Research: focuses on deception and scheming
NIST AISIC (US): emerging public-sector evaluator

How to design your own held-out set

Collect 50-200 real tasks from your own workflow
Grade each with explicit rubric (not gut feeling)
Never post the prompts or the rubric online
Re-run on every new model release
Keep it versioned so results are comparable over time

Public evaluations do not tell us about frontier capabilities. Private evaluations do, but nobody can check them.
— A policy researcher at a frontier AI lab

The big idea: the healthiest evaluation ecosystem mixes public (verifiable), private (trustworthy), and your own (relevant) measurements.

End-of-lesson check

8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-private-vs-public-evals

What is the main idea of "Private vs. Public Evaluations"?
1. Public benchmarks get gamed.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Private vs. Public Evaluations"?
1. held-out
2. private eval
3. third-party evaluation
4. held-out set
Which use of AI fits this topic best?
1. Let the AI decide what matters without your review
2. Use the answer before checking whether it fits the situation
3. METR: evaluates dangerous capabilities and autonomy
4. Treat the AI output as automatically correct
What should a careful learner remember about "Your private eval"?
1. Use AI to draft or organize ideas about private eval, then verify before acting.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. Use AI for drafting and comparison, but verify before publishing or relying on it.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about private eval be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about private eval.
Which action would help you apply "Private vs. Public Evaluations" responsibly?
1. Use the tool to avoid thinking through the tradeoff
2. Keep going even if the output conflicts with a trusted source
3. Treat the AI output as automatically correct
4. UK AISI: government body, access to pre-release models

← Back to interactive lesson

Tendril · Creators · AI Foundations

Private vs. Public Evaluations

35 min · Reviewed 2026

The Transparency Trade-off

Public benchmarks are verifiable — anyone can reproduce the score. Private benchmarks are trustworthy — they cannot be gamed or leaked. Every evaluation regime must trade these two goods off.

Eval type	Transparency	Trustworthiness
Fully public (MMLU, HumanEval)	High	Degrades with time
Held-out test (kept private by authors)	Medium	High at launch, degrades as items leak
Blind third-party (METR, UK AISI)	Low	Very high
Live human eval (Arena)	Medium	High; cannot memorize what is unknown

Third-party evaluators

METR: evaluates dangerous capabilities and autonomy
UK AISI: government body, access to pre-release models
Apollo Research: focuses on deception and scheming
NIST AISIC (US): emerging public-sector evaluator

How to design your own held-out set

Collect 50-200 real tasks from your own workflow
Grade each with explicit rubric (not gut feeling)
Never post the prompts or the rubric online
Re-run on every new model release
Keep it versioned so results are comparable over time

Public evaluations do not tell us about frontier capabilities. Private evaluations do, but nobody can check them.
— A policy researcher at a frontier AI lab

The big idea: the healthiest evaluation ecosystem mixes public (verifiable), private (trustworthy), and your own (relevant) measurements.

End-of-lesson check

8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-private-vs-public-evals

What is the main idea of "Private vs. Public Evaluations"?
1. Public benchmarks get gamed.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Private vs. Public Evaluations"?
1. held-out
2. private eval
3. third-party evaluation
4. held-out set
Which use of AI fits this topic best?
1. Let the AI decide what matters without your review
2. Use the answer before checking whether it fits the situation
3. METR: evaluates dangerous capabilities and autonomy
4. Treat the AI output as automatically correct
What should a careful learner remember about "Your private eval"?
1. Use AI to draft or organize ideas about private eval, then verify before acting.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. Use AI for drafting and comparison, but verify before publishing or relying on it.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about private eval be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about private eval.
Which action would help you apply "Private vs. Public Evaluations" responsibly?
1. Use the tool to avoid thinking through the tradeoff
2. Keep going even if the output conflicts with a trusted source
3. Treat the AI output as automatically correct
4. UK AISI: government body, access to pre-release models

← Back to interactive lesson