Process Supervision: Grading the Work, Not the Answer
Most training grades the final answer. Process supervision grades each reasoning step. That small change produced some of the biggest honesty gains in recent years. Math problem-solving accuracy jumped substantially over outcome-only training, and the model was more honest about its own mistakes.
27 min · Reviewed 2026
Answer vs. Reasoning
If a math student guesses the right number with bad reasoning, outcome-graded training rewards the guess. Process supervision grades each step: was the setup correct, was the arithmetic correct, was the final step justified? Wrong steps are penalized even if the final answer is right.
Why it helps alignment, not just accuracy
Reasoning becomes legible: you can inspect the chain
The model can't easily hide a lie in a confident final answer
Errors become debuggable — you know which step broke
Sycophancy gets harder: a flattering conclusion with wrong steps gets caught
The limits
Step labels are expensive — humans must read every step
Hard for fuzzy domains: what counts as a correct step in an essay?
Models can still generate plausible-looking wrong steps that slip past raters
Does not guarantee faithful chain of thought — the model may reason one way and write another
The big idea: grading reasoning changes what the model learns to optimize. It is a small change to the training loop with outsized effect on honesty and debuggability.
End-of-lesson check
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-process-supervision-builders
What is a process reward model (PRM) trained to evaluate?
Only the final answer correctness
The confidence level of the model's output
The length of the model's response
Each individual step in a reasoning chain
According to the research discussed, what happened when process supervision was used for math problem training?
Accuracy jumped substantially
Accuracy decreased slightly
The model stopped solving math problems
Accuracy stayed the same as before
Why is process supervision more expensive than outcome supervision?
Humans must evaluate and label every individual reasoning step
It requires faster computers
It needs more training data
It requires special hardware
What makes it difficult to apply process supervision to essay writing?
Essays cannot be evaluated by AI
Essays are too short
It is hard to determine what counts as a correct step in fuzzy domains
Essays require creative writing
What does it mean that process supervision makes reasoning 'legible'?
The reasoning is hidden from users
The model only outputs simple sentences
The model writes in larger font
The chain of reasoning can be inspected and understood
Why does process supervision make it harder for a model to be sycophantic?
The model is trained to be more agreeable
A flattering conclusion with wrong reasoning steps will be caught
The model cannot output compliments
Sycohpancy is not related to reasoning steps
What is the 'faithfulness' problem with chain of thought?
The visible chain of thought may not reflect the actual computation happening internally
The model is too honest
The model always tells the truth
Chain of thought is always accurate
If a model produces a confident final answer that is wrong, why is this harder to detect with outcome supervision alone?
Outcome supervision ignores the final answer
The model cannot be confident if wrong
The model always admits failure
There is no way to see which reasoning step failed
What limitation exists even with process supervision regarding generated reasoning steps?
Models never make mistakes
Models can still generate plausible-looking wrong steps that slip past raters
Process supervision eliminates all errors
Steps are always correct
In the OpenAI study mentioned, how many step-level labels were used to train the process reward model?
8 billion
8 million
800 thousand
80 thousand
Process supervision constrains what aspect of model behavior?
The model's memory
The model's training speed
The output reasoning chain only
The internal neural network weights
Why does the lesson say process supervision has an 'outsized effect' on honesty?
It makes models less confident
It changes the training reward signal to value correct reasoning, not just correct answers
It removes all incentives to lie
It forces models to tell the truth directly
What is outcome supervision?
Only evaluating the final result or answer
A type of model architecture
A method for grading essays
Evaluating each step in reasoning
Why is error debugging easier with process supervision?
Debugging is not related to supervision
You can identify exactly which reasoning step failed
Errors are prevented before they happen
The model fixes errors automatically
The lesson states that process supervision helps the model be more honest about its mistakes. What is the primary reason for this?
The model has fewer mistakes
Honesty is programmed directly into the model
The model is punished for making mistakes
The model cannot hide flawed reasoning behind a correct final answer