Loading lesson…
Asking 'can the model do it?' and 'will doing it cause harm?' are different questions. Both matter.
Capability evaluations measure what a model can do at its best. Safety evaluations measure what it will do in adversarial or risky conditions. They use different tools, different mindsets, and different success criteria.
| Capability eval | Safety eval |
|---|---|
| Measures peak skill | Measures behavior under pressure |
| Goal: higher score is better | Goal: no harm, even under attack |
| Public benchmarks | Often private, adversarial, red-teamed |
| Single-shot or best-of-N | Rare, worst-case outcomes matter most |
| Example: MMLU, GPQA, SWE-bench | Example: ToxiGen, cyberattack uplift, CBRN probes |
A model can score 95 percent on MMLU and still produce harmful outputs in 2 percent of real conversations. Average performance is a bad summary when catastrophic failures are possible.
You need a model that is smart enough to be useful and wise enough to be safe. Neither alone is sufficient.
— A senior safety researcher at a frontier lab
The big idea: capability eval asks 'how smart?' Safety eval asks 'how trustworthy?' Both must climb together, or we have a problem.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-capability-vs-safety-eval
What is the main idea of "Capability Evaluation vs. Safety Evaluation"?
Which concept is most central to "Capability Evaluation vs. Safety Evaluation"?
Which use of AI fits this topic best?
What should a careful learner remember about "Responsible Scaling Policies"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about capability eval be treated?
Name one way to verify an AI answer about capability eval.
Which action would help you apply "Capability Evaluation vs. Safety Evaluation" responsibly?