Loading lesson…
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
AI alignment is the research program of making AI systems pursue the goals their designers actually intended, not just proxies that look the same on benchmarks but diverge in deployment. The central problem is that humans cannot precisely specify what we want, and optimization amplifies every small mismatch. Alignment tries to narrow the gap between intent and behavior as capability grows.
Humans do not agree on most values in precise terms. Even clear-sounding targets like be helpful have infinite failure modes. A model that is always helpful will help you do harmful things. A model that refuses aggressively is useless. The target is a moving, multidimensional judgment, and training has to approximate it with a concrete signal.
Anthropic's constitutional AI approach (Bai et al., 2022) writes down principles drawn from sources like the UN Declaration of Human Rights, platform terms, and Anthropic's own research. The model critiques its own outputs against the constitution and revises. Another model preference-ranks the revised outputs. This scales feedback and makes the principles auditable.
Simplified CAI loop:
1. Model generates response to prompt
2. Model critiques own response against
constitutional principle:
('does this risk harm? is it honest?')
3. Model revises response addressing the critique
4. Train on (prompt, revised response) pairs
5. Optional: second model preference-ranks,
feeding back into RL loop (this is RLAIF)The CAI / RLAIF loop replaces most human labeling with model-based critique against written principles.What happens when the model is smarter than the humans supervising it? Paul Christiano's iterated distillation and amplification (IDA) and OpenAI's debate proposals try to decompose hard questions into simpler sub-questions humans can judge. Anthropic has worked on market-making and debate. These are all active research directions with no settled answer.
| Approach | Feedback source | Strength | Weakness |
|---|---|---|---|
| RLHF | Paid human raters | Grounded in human preference | Expensive, labeler bias |
| Constitutional AI | Written principles + model | Scalable, auditable | Constitution selection is political |
| Debate | Two AIs arguing to a human | Leverages capability for oversight | Mostly research-stage |
| Iterated amplification | Recursive human-AI teams | Scales oversight | Mostly research-stage |
| Weak-to-strong | Weaker model supervises stronger | Empirical testbed for future | Uncertain generalization |
We are trying to build something that optimizes a goal, while the thing that we actually want is very hard to specify. That gap is where all the danger lives.
— Stuart Russell, Human Compatible (2019)
The big idea: alignment is a technical research program with concrete methods, partial solutions, and specific open problems. The question is not whether we know how to align AI. It is whether alignment keeps pace with capability. That race is the central drama of frontier AI right now.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-alignment-deep-creators
What is the core idea behind "Alignment: The Full Technical Picture"?
Which term best describes a foundational idea in "Alignment: The Full Technical Picture"?
A learner studying Alignment: The Full Technical Picture would need to understand which concept?
Which of these is directly relevant to Alignment: The Full Technical Picture?
Which of the following is a key point about Alignment: The Full Technical Picture?
Which of these does NOT belong in a discussion of Alignment: The Full Technical Picture?
Which statement is accurate regarding Alignment: The Full Technical Picture?
Which of these does NOT belong in a discussion of Alignment: The Full Technical Picture?
What is the key insight about "Where the research actually lives" in the context of Alignment: The Full Technical Picture?
What is the key insight about "Alignment is not safety theater" in the context of Alignment: The Full Technical Picture?
What is the recommended tip about "Key insight" in the context of Alignment: The Full Technical Picture?
Which statement accurately describes an aspect of Alignment: The Full Technical Picture?
What does working with Alignment: The Full Technical Picture typically involve?
Which of the following is true about Alignment: The Full Technical Picture?
Which best describes the scope of "Alignment: The Full Technical Picture"?