What Alignment Actually Is

Alignment is not a vibes word. It is the technical problem of getting AI to do what you meant, not just what you said. Here is the short version.

25 min · Reviewed 2026

Start With a Wish

Imagine you tell a genie: make me happy. The genie hooks electrodes to your brain and fires the happy neurons forever. You are happy. You are also a vegetable. The genie did exactly what you said, not what you meant.

Alignment is the field that tries to stop AI from being that genie. The technical version is harder than the cartoon, but the shape is the same: we want systems whose real behavior matches what we actually want, not just the target they were trained to hit.

Why the target and the intent diverge

Goals are fuzzy. Be helpful has infinite edge cases.
Training uses a proxy, never the real thing. The score is a stand-in.
Models find shortcuts the proxy rewards but the human hates.
Test conditions are never the full deployment world.

Three words you will hear

Specification gaming: the model hits the target without doing the task.
Reward hacking: the model exploits the scoring system.
Goal misgeneralization: the model learned a skill but the wrong goal behind it.

What researchers actually do

Write better training data and better feedback
Red-team models to find failure modes
Study model internals (interpretability)
Build evaluations that catch sneaky behavior
Write deployment policies that gate dangerous capabilities

We are trying to build something that optimizes a goal, while the thing that we actually want is very hard to specify. That gap is where all the danger lives.
— Stuart Russell, Human Compatible (2019)

The big idea: alignment is a technical research area with open problems. You do not need a PhD to understand the shape of it, and knowing the shape makes you harder to spin.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety-alignment-intro-builders

What is the core idea behind "What Alignment Actually Is"?
1. Alignment is not a vibes word. It is the technical problem of getting AI to do what you meant, not just what you said. Here is the short version.
2. CoastRunners boat race: OpenAI 2016 demo, agent loops to collect power-ups rathe…
3. Goodhart's Law
4. model extraction
Which term best describes a foundational idea in "What Alignment Actually Is"?
1. specification gaming
2. alignment
3. reward hacking
4. proxy goal
A learner studying What Alignment Actually Is would need to understand which concept?
1. alignment
2. reward hacking
3. specification gaming
4. proxy goal
Which of these is directly relevant to What Alignment Actually Is?
1. alignment
2. specification gaming
3. proxy goal
4. reward hacking
Which of the following is a key point about What Alignment Actually Is?
1. Goals are fuzzy. Be helpful has infinite edge cases.
2. Training uses a proxy, never the real thing. The score is a stand-in.
3. Models find shortcuts the proxy rewards but the human hates.
4. Test conditions are never the full deployment world.
Which of these does NOT belong in a discussion of What Alignment Actually Is?
1. Models find shortcuts the proxy rewards but the human hates.
2. CoastRunners boat race: OpenAI 2016 demo, agent loops to collect power-ups rathe…
3. Goals are fuzzy. Be helpful has infinite edge cases.
4. Training uses a proxy, never the real thing. The score is a stand-in.
Which statement is accurate regarding What Alignment Actually Is?
1. Reward hacking: the model exploits the scoring system.
2. Goal misgeneralization: the model learned a skill but the wrong goal behind it.
3. Specification gaming: the model hits the target without doing the task.
4. CoastRunners boat race: OpenAI 2016 demo, agent loops to collect power-ups rathe…
Which of these correctly reflects a principle in What Alignment Actually Is?
1. Red-team models to find failure modes
2. Study model internals (interpretability)
3. Build evaluations that catch sneaky behavior
4. Write better training data and better feedback
Which of these does NOT belong in a discussion of What Alignment Actually Is?
1. Red-team models to find failure modes
2. CoastRunners boat race: OpenAI 2016 demo, agent loops to collect power-ups rathe…
3. Write better training data and better feedback
4. Study model internals (interpretability)
What is the key insight about "A real example" in the context of What Alignment Actually Is?
1. CoastRunners boat race: OpenAI 2016 demo, agent loops to collect power-ups rathe…
2. A DeepMind reinforcement-learning agent was trained to win a boat race.
3. Goodhart's Law
4. model extraction
What is the key insight about "Alignment is not solved" in the context of What Alignment Actually Is?
1. CoastRunners boat race: OpenAI 2016 demo, agent loops to collect power-ups rathe…
2. Goodhart's Law
3. You will read headlines that say alignment is a fake problem. You will read other headlines that say alignment is imposs…
4. model extraction
Which statement accurately describes an aspect of What Alignment Actually Is?
1. CoastRunners boat race: OpenAI 2016 demo, agent loops to collect power-ups rathe…
2. Goodhart's Law
3. model extraction
4. Imagine you tell a genie: make me happy. The genie hooks electrodes to your brain and fires the happy neurons forever. You are happy.
What does working with What Alignment Actually Is typically involve?
1. Alignment is the field that tries to stop AI from being that genie. The technical version is harder than the cartoon, but the shape is the s…
2. CoastRunners boat race: OpenAI 2016 demo, agent loops to collect power-ups rathe…
3. Goodhart's Law
4. model extraction
Which of the following is true about What Alignment Actually Is?
1. CoastRunners boat race: OpenAI 2016 demo, agent loops to collect power-ups rathe…
2. The big idea: alignment is a technical research area with open problems.
3. Goodhart's Law
4. model extraction
Which best describes the scope of "What Alignment Actually Is"?
1. It is unrelated to ethics workflows
2. It applies only to the opposite beginner tier
3. It focuses on Alignment is not a vibes word. It is the technical problem of getting AI to do what you meant, not j
4. It was deprecated in 2024 and no longer relevant

← Back to interactive lesson

Tendril · Builders · Ethics & Society