Loading lesson…
When AI outputs get too long, too technical, or too fast for humans to check, how do you know it is doing the right thing? Scalable oversight is the research program trying to answer that.
Human feedback is the backbone of modern alignment. Raters read a model answer and upvote or downvote. That works great when the answer is short and the rater is qualified. It breaks when the answer is a 30-page research paper, a 10,000-line codebase, or a claim in a field no rater actually knows.
The hope is that we can use AI to help us align smarter AI, bootstrapping supervision up the capability curve.
— Jan Leike, formerly OpenAI superalignment
The big idea: alignment at scale is not just better labels. It is a research bet that supervision itself can be amplified without losing the human anchor.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-scalable-oversight-builders
What is the core issue called the 'bandwidth problem' in AI oversight?
Why might raters struggle to evaluate AI outputs in specialized domains like advanced biology or mathematics?
In the 'debate' approach to scalable oversight, what do the two models do?
What does 'iterated amplification' mean as an oversight strategy?
What is 'recursive reward modeling'?
How does 'process supervision' differ from standard evaluation methods?
What is a 'critique model' in scalable oversight?
What risk does 'rater fatigue' create in human feedback systems?
What is the main goal of scalable oversight research?
Why might RLHF alone become insufficient for very capable future AI systems?
What did the lesson say about whether current scalable oversight methods are proven to work?
In scalable oversight, what keeps the 'human in the loop' truly human?
Why is it dangerous if a model is wrong in ways the rater cannot see?
What does it mean to 'bootstrap supervision up the capability curve'?
The lesson compares scalable oversight to a 'research bet.' What does this mean?