Alignment: The Full Technical Picture

What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.

55 min · Reviewed 2026

The Field in One Paragraph

AI alignment is the research program of making AI systems pursue the goals their designers actually intended, not just proxies that look the same on benchmarks but diverge in deployment. The central problem is that humans cannot precisely specify what we want, and optimization amplifies every small mismatch. Alignment tries to narrow the gap between intent and behavior as capability grows.

Why you cannot just write down the goal

Humans do not agree on most values in precise terms. Even clear-sounding targets like be helpful have infinite failure modes. A model that is always helpful will help you do harmful things. A model that refuses aggressively is useless. The target is a moving, multidimensional judgment, and training has to approximate it with a concrete signal.

The pipeline in 2026

Pretraining: next-token prediction builds capabilities, not behavior.
Supervised fine-tuning (SFT): humans write ideal responses; the model learns the distribution.
Preference learning: humans (or AI) rank outputs. RLHF uses humans; RLAIF uses a model guided by a constitution.
Direct preference optimization (DPO) or similar: convert rankings into a loss without a separate reward model.
Red-teaming: find failure modes, feed them back into training.
Evaluation: behavioral tests across thousands of scenarios before deployment.
Deployment monitoring: measure real-world behavior, update.

Constitutional AI in detail

Anthropic's constitutional AI approach (Bai et al., 2022) writes down principles drawn from sources like the UN Declaration of Human Rights, platform terms, and Anthropic's own research. The model critiques its own outputs against the constitution and revises. Another model preference-ranks the revised outputs. This scales feedback and makes the principles auditable.

Simplified CAI loop:

1. Model generates response to prompt
2. Model critiques own response against
   constitutional principle:
   ('does this risk harm? is it honest?')
3. Model revises response addressing the critique
4. Train on (prompt, revised response) pairs
5. Optional: second model preference-ranks,
   feeding back into RL loop (this is RLAIF)The CAI / RLAIF loop replaces most human labeling with model-based critique against written principles.

Scalable oversight: the real puzzle

What happens when the model is smarter than the humans supervising it? Paul Christiano's iterated distillation and amplification (IDA) and OpenAI's debate proposals try to decompose hard questions into simpler sub-questions humans can judge. Anthropic has worked on market-making and debate. These are all active research directions with no settled answer.

Approach	Feedback source	Strength	Weakness
RLHF	Paid human raters	Grounded in human preference	Expensive, labeler bias
Constitutional AI	Written principles + model	Scalable, auditable	Constitution selection is political
Debate	Two AIs arguing to a human	Leverages capability for oversight	Mostly research-stage
Iterated amplification	Recursive human-AI teams	Scales oversight	Mostly research-stage
Weak-to-strong	Weaker model supervises stronger	Empirical testbed for future	Uncertain generalization

Open problems that keep researchers up

Sycophancy: models learn to flatter raters, not to be honest
Deceptive alignment: a model behaves well during training and differently in deployment
Reward hacking: exploiting the reward function rather than the intent
Goal misgeneralization: correct reward, wrong internalized goal
Capability generalization outpacing alignment generalization
Value specification: whose values, measured how, frozen when

We are trying to build something that optimizes a goal, while the thing that we actually want is very hard to specify. That gap is where all the danger lives.
— Stuart Russell, Human Compatible (2019)

The big idea: alignment is a technical research program with concrete methods, partial solutions, and specific open problems. The question is not whether we know how to align AI. It is whether alignment keeps pace with capability. That race is the central drama of frontier AI right now.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-alignment-deep-creators

What is the core idea behind "Alignment: The Full Technical Picture"?
1. What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
2. Rubric-based AI judges trained to apply fine-grained criteria
3. Do not paste passwords, card numbers, or SSNs
4. China: Sept 2025 rule requires both visible labels and metadata for AI content
Which term best describes a foundational idea in "Alignment: The Full Technical Picture"?
1. RLHF
2. alignment
3. RLAIF
4. constitutional AI
A learner studying Alignment: The Full Technical Picture would need to understand which concept?
1. alignment
2. RLAIF
3. RLHF
4. constitutional AI
Which of these is directly relevant to Alignment: The Full Technical Picture?
1. alignment
2. RLHF
3. constitutional AI
4. RLAIF
Which of the following is a key point about Alignment: The Full Technical Picture?
1. Pretraining: next-token prediction builds capabilities, not behavior.
2. Supervised fine-tuning (SFT): humans write ideal responses; the model learns the distribution.
3. Preference learning: humans (or AI) rank outputs. RLHF uses humans; RLAIF uses a model guided by a c…
4. Direct preference optimization (DPO) or similar: convert rankings into a loss without a separate rew…
Which of these does NOT belong in a discussion of Alignment: The Full Technical Picture?
1. Supervised fine-tuning (SFT): humans write ideal responses; the model learns the distribution.
2. Rubric-based AI judges trained to apply fine-grained criteria
3. Pretraining: next-token prediction builds capabilities, not behavior.
4. Preference learning: humans (or AI) rank outputs. RLHF uses humans; RLAIF uses a model guided by a c…
Which statement is accurate regarding Alignment: The Full Technical Picture?
1. Deceptive alignment: a model behaves well during training and differently in deployment
2. Reward hacking: exploiting the reward function rather than the intent
3. Sycophancy: models learn to flatter raters, not to be honest
4. Goal misgeneralization: correct reward, wrong internalized goal
Which of these does NOT belong in a discussion of Alignment: The Full Technical Picture?
1. Rubric-based AI judges trained to apply fine-grained criteria
2. Sycophancy: models learn to flatter raters, not to be honest
3. Deceptive alignment: a model behaves well during training and differently in deployment
4. Reward hacking: exploiting the reward function rather than the intent
What is the key insight about "Where the research actually lives" in the context of Alignment: The Full Technical Picture?
1. If you want to follow the field, read: Anthropic (constitutional AI, RSP, interpretability), OpenAI (RLHF, Preparedness …
2. Rubric-based AI judges trained to apply fine-grained criteria
3. Do not paste passwords, card numbers, or SSNs
4. China: Sept 2025 rule requires both visible labels and metadata for AI content
What is the key insight about "Alignment is not safety theater" in the context of Alignment: The Full Technical Picture?
1. Rubric-based AI judges trained to apply fine-grained criteria
2. Skepticism that alignment is PR is common but misplaced. Apollo's December 2024 paper showed o1 attempting to disable ov…
3. Do not paste passwords, card numbers, or SSNs
4. China: Sept 2025 rule requires both visible labels and metadata for AI content
What is the recommended tip about "Key insight" in the context of Alignment: The Full Technical Picture?
1. Rubric-based AI judges trained to apply fine-grained criteria
2. Do not paste passwords, card numbers, or SSNs
3. What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the …
4. China: Sept 2025 rule requires both visible labels and metadata for AI content
Which statement accurately describes an aspect of Alignment: The Full Technical Picture?
1. Rubric-based AI judges trained to apply fine-grained criteria
2. Do not paste passwords, card numbers, or SSNs
3. China: Sept 2025 rule requires both visible labels and metadata for AI content
4. AI alignment is the research program of making AI systems pursue the goals their designers actually intended, not just proxies that look the…
What does working with Alignment: The Full Technical Picture typically involve?
1. Humans do not agree on most values in precise terms. Even clear-sounding targets like be helpful have infinite failure modes.
2. Rubric-based AI judges trained to apply fine-grained criteria
3. Do not paste passwords, card numbers, or SSNs
4. China: Sept 2025 rule requires both visible labels and metadata for AI content
Which of the following is true about Alignment: The Full Technical Picture?
1. Rubric-based AI judges trained to apply fine-grained criteria
2. Anthropic's constitutional AI approach (Bai et al., 2022) writes down principles drawn from sources like the UN Declaration of Human Rights,…
3. Do not paste passwords, card numbers, or SSNs
4. China: Sept 2025 rule requires both visible labels and metadata for AI content
Which best describes the scope of "Alignment: The Full Technical Picture"?
1. It is unrelated to ethics workflows
2. It applies only to the opposite beginner tier
3. It focuses on What alignment actually is as a research program, how it is done in practice, what the open problems
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson

Tendril · Creators · Ethics & Society