AI Alignment: The Actual Technical Problem

Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.

50 min · Reviewed 2026

The Core Problem in One Sentence

Alignment is the problem of making AI systems pursue the goals their designers actually intended, not just goals that look the same on a benchmark but diverge in the wild. It sounds simple. It is not.

Why you cannot just write down the goal

Humans do not agree on most goals in precise terms. Even clear-sounding goals like be helpful have infinite failure modes. A model that is always helpful will help you do harmful things. A model that refuses aggressively becomes useless. The target is a moving, multidimensional judgment call, and the training signal has to approximate it.

How alignment is done in practice

Pretraining: next-token prediction teaches the base capabilities, not behavior.
Supervised fine-tuning (SFT): humans write ideal responses, model learns the distribution.
RLHF (Reinforcement Learning from Human Feedback): humans rank outputs, model optimizes for preferred outputs.
RLAIF (RL from AI Feedback): another model, guided by a constitution, does the ranking.
Red-teaming: humans try to break the model, failure modes feed back into training.
Evaluation: behavioral tests across thousands of scenarios before deployment.

Constitutional AI

Anthropic's constitutional AI approach (Bai et al., 2022) writes down a set of principles (drawn from sources like the UN Declaration of Human Rights, platform terms of service, and original safety research) and uses them to generate training feedback without a human in every loop. The model critiques its own outputs against the constitution and revises them. This scales feedback and makes the principles auditable.

Simplified CAI loop:

1. Model generates response to prompt
2. Model critiques own response using constitution
   principle (e.g., 'does this response risk harm?')
3. Model revises response addressing the critique
4. Train on (prompt, revised response) pairs
5. Optionally: use another model as preference judge
   (this is RLAIF)The CAI / RLAIF loop replaces most human preference labeling with model-based critique against a written constitution.

Open problems that keep researchers up at night

Sycophancy: models learn to flatter raters, not be honest
Deceptive alignment: a model that behaves well during training and differently in deployment
Reward hacking: exploiting the reward function rather than the intent
Scalable oversight: how do humans supervise a model smarter than them?
Value specification: whose values? measured how? frozen when?
Capability generalization outpacing alignment generalization

Compare: alignment approaches

Approach	Feedback source	Strength	Weakness
RLHF	Paid human raters	Grounded in human preference	Expensive, labeler bias
Constitutional AI	Written principles + model	Scalable, auditable	Constitution selection is political
Debate	Two AIs arguing to a human	Leverages model capability for oversight	Mostly research-stage
Amplification	Recursive human-AI teams	Scales oversight	Mostly research-stage

Where the research actually lives

Anthropic: constitutional AI, interpretability, RSP
OpenAI: RLHF pioneered, safety teams reorganized in 2024-2025
DeepMind: scalable oversight, evaluations
METR: model evaluations for autonomy and capabilities
Apollo Research: scheming and deceptive alignment evals
Redwood Research: interpretability, AI control
UK AISI and US AISI: government-run evaluations
Alignment Research Center (ARC Evals): autonomous replication evals

We are trying to build something that optimizes a goal, while the thing that we actually want is very hard to specify. That gap is where all the danger lives.
— Stuart Russell, Human Compatible (2019)

The big idea: alignment is a technical research program with real open problems and concrete partial solutions. The question is not whether we know how to align AI. It is whether alignment keeps pace with capability. That race is the central drama of frontier AI right now.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-ethics-alignment-technical-creators

What is the core idea behind "AI Alignment: The Actual Technical Problem"?
1. Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
2. Verify before sharing (slow consumption practice)
3. Saying 'please' and 'thank you' helps your habits
4. Propose distribution restrictions
Which term best describes a foundational idea in "AI Alignment: The Actual Technical Problem"?
1. RLHF
2. alignment
3. RLAIF
4. constitutional AI
A learner studying AI Alignment: The Actual Technical Problem would need to understand which concept?
1. alignment
2. RLAIF
3. RLHF
4. constitutional AI
Which of these is directly relevant to AI Alignment: The Actual Technical Problem?
1. alignment
2. RLHF
3. constitutional AI
4. RLAIF
Which of the following is a key point about AI Alignment: The Actual Technical Problem?
1. Pretraining: next-token prediction teaches the base capabilities, not behavior.
2. Supervised fine-tuning (SFT): humans write ideal responses, model learns the distribution.
3. RLHF (Reinforcement Learning from Human Feedback): humans rank outputs, model optimizes for preferre…
4. RLAIF (RL from AI Feedback): another model, guided by a constitution, does the ranking.
Which of these does NOT belong in a discussion of AI Alignment: The Actual Technical Problem?
1. Verify before sharing (slow consumption practice)
2. RLHF (Reinforcement Learning from Human Feedback): humans rank outputs, model optimizes for preferre…
3. Supervised fine-tuning (SFT): humans write ideal responses, model learns the distribution.
4. Pretraining: next-token prediction teaches the base capabilities, not behavior.
Which statement is accurate regarding AI Alignment: The Actual Technical Problem?
1. Deceptive alignment: a model that behaves well during training and differently in deployment
2. Reward hacking: exploiting the reward function rather than the intent
3. Sycophancy: models learn to flatter raters, not be honest
4. Scalable oversight: how do humans supervise a model smarter than them?
Which of these does NOT belong in a discussion of AI Alignment: The Actual Technical Problem?
1. Sycophancy: models learn to flatter raters, not be honest
2. Reward hacking: exploiting the reward function rather than the intent
3. Verify before sharing (slow consumption practice)
4. Deceptive alignment: a model that behaves well during training and differently in deployment
What is the key insight about "Specification gaming: the canonical failure" in the context of AI Alignment: The Actual Technical Problem?
1. A DeepMind study famously showed an RL agent trained to win a boat race learning to loop in circles collecting power-ups…
2. Verify before sharing (slow consumption practice)
3. Saying 'please' and 'thank you' helps your habits
4. Propose distribution restrictions
What is the key insight about "Alignment is not just safety theater" in the context of AI Alignment: The Actual Technical Problem?
1. Verify before sharing (slow consumption practice)
2. Skepticism that alignment research is mostly PR is common. Look at the actual papers — Apollo's 2024 work showed o1 atte…
3. Saying 'please' and 'thank you' helps your habits
4. Propose distribution restrictions
What is the recommended tip about "Key insight" in the context of AI Alignment: The Actual Technical Problem?
1. Verify before sharing (slow consumption practice)
2. Saying 'please' and 'thank you' helps your habits
3. Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually wa…
4. Propose distribution restrictions
Which statement accurately describes an aspect of AI Alignment: The Actual Technical Problem?
1. Verify before sharing (slow consumption practice)
2. Saying 'please' and 'thank you' helps your habits
3. Propose distribution restrictions
4. Alignment is the problem of making AI systems pursue the goals their designers actually intended, not just goals that look the same on a ben…
What does working with AI Alignment: The Actual Technical Problem typically involve?
1. Humans do not agree on most goals in precise terms. Even clear-sounding goals like be helpful have infinite failure modes.
2. Verify before sharing (slow consumption practice)
3. Saying 'please' and 'thank you' helps your habits
4. Propose distribution restrictions
Which of the following is true about AI Alignment: The Actual Technical Problem?
1. Verify before sharing (slow consumption practice)
2. Anthropic's constitutional AI approach (Bai et al., 2022) writes down a set of principles (drawn from sources like the UN Declaration of Hum…
3. Saying 'please' and 'thank you' helps your habits
4. Propose distribution restrictions
Which best describes the scope of "AI Alignment: The Actual Technical Problem"?
1. It is unrelated to ethics workflows
2. It applies only to the opposite beginner tier
3. It focuses on Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson

Tendril · Creators · Ethics & Society