Process Supervision: Grading the Work, Not the Answer

Most training grades the final answer. Process supervision grades each reasoning step. That small change produced some of the biggest honesty gains in recent years. Math problem-solving accuracy jumped substantially over outcome-only training, and the model was more honest about its own mistakes.

27 min · Reviewed 2026

Answer vs. Reasoning

If a math student guesses the right number with bad reasoning, outcome-graded training rewards the guess. Process supervision grades each step: was the setup correct, was the arithmetic correct, was the final step justified? Wrong steps are penalized even if the final answer is right.

Why it helps alignment, not just accuracy

Reasoning becomes legible: you can inspect the chain
The model can't easily hide a lie in a confident final answer
Errors become debuggable — you know which step broke
Sycophancy gets harder: a flattering conclusion with wrong steps gets caught

The limits

Step labels are expensive — humans must read every step
Hard for fuzzy domains: what counts as a correct step in an essay?
Models can still generate plausible-looking wrong steps that slip past raters
Does not guarantee faithful chain of thought — the model may reason one way and write another

The big idea: grading reasoning changes what the model learns to optimize. It is a small change to the training loop with outsized effect on honesty and debuggability.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-process-supervision-builders

What is a process reward model (PRM) trained to evaluate?
1. Only the final answer correctness
2. The confidence level of the model's output
3. The length of the model's response
4. Each individual step in a reasoning chain
According to the research discussed, what happened when process supervision was used for math problem training?
1. Accuracy jumped substantially
2. Accuracy decreased slightly
3. The model stopped solving math problems
4. Accuracy stayed the same as before
Why is process supervision more expensive than outcome supervision?
1. Humans must evaluate and label every individual reasoning step
2. It requires faster computers
3. It needs more training data
4. It requires special hardware
What makes it difficult to apply process supervision to essay writing?
1. Essays cannot be evaluated by AI
2. Essays are too short
3. It is hard to determine what counts as a correct step in fuzzy domains
4. Essays require creative writing
What does it mean that process supervision makes reasoning 'legible'?
1. The reasoning is hidden from users
2. The model only outputs simple sentences
3. The model writes in larger font
4. The chain of reasoning can be inspected and understood
Why does process supervision make it harder for a model to be sycophantic?
1. The model is trained to be more agreeable
2. A flattering conclusion with wrong reasoning steps will be caught
3. The model cannot output compliments
4. Sycohpancy is not related to reasoning steps
What is the 'faithfulness' problem with chain of thought?
1. The visible chain of thought may not reflect the actual computation happening internally
2. The model is too honest
3. The model always tells the truth
4. Chain of thought is always accurate
If a model produces a confident final answer that is wrong, why is this harder to detect with outcome supervision alone?
1. Outcome supervision ignores the final answer
2. The model cannot be confident if wrong
3. The model always admits failure
4. There is no way to see which reasoning step failed
What limitation exists even with process supervision regarding generated reasoning steps?
1. Models never make mistakes
2. Models can still generate plausible-looking wrong steps that slip past raters
3. Process supervision eliminates all errors
4. Steps are always correct
In the OpenAI study mentioned, how many step-level labels were used to train the process reward model?
1. 8 billion
2. 8 million
3. 800 thousand
4. 80 thousand
Process supervision constrains what aspect of model behavior?
1. The model's memory
2. The model's training speed
3. The output reasoning chain only
4. The internal neural network weights
Why does the lesson say process supervision has an 'outsized effect' on honesty?
1. It makes models less confident
2. It changes the training reward signal to value correct reasoning, not just correct answers
3. It removes all incentives to lie
4. It forces models to tell the truth directly
What is outcome supervision?
1. Only evaluating the final result or answer
2. A type of model architecture
3. A method for grading essays
4. Evaluating each step in reasoning
Why is error debugging easier with process supervision?
1. Debugging is not related to supervision
2. You can identify exactly which reasoning step failed
3. Errors are prevented before they happen
4. The model fixes errors automatically
The lesson states that process supervision helps the model be more honest about its mistakes. What is the primary reason for this?
1. The model has fewer mistakes
2. Honesty is programmed directly into the model
3. The model is punished for making mistakes
4. The model cannot hide flawed reasoning behind a correct final answer

← Back to interactive lesson

Tendril · Builders · Ethics & Society