Tendril

Lesson 253 of 1570

What Alignment Actually Is

Alignment is not a vibes word. It is the technical problem of getting AI to do what you meant, not just what you said. Here is the short version.

BuildersEthics & Society~15 min readIntermediateBI5 · Societal ImpactBI3 · LearningPrint / PDF

Lesson map

What this lesson covers

25 min16 blocks3 concepts

Learning path

The main moves in order

1Start With a Wish
2alignment
3intent vs. literal
4proxy goals

Concept cluster

Terms to connect while reading

alignmentintent vs. literalproxy goals

Sections4

Lists3

Notes4

Quotes1

Terms1

Section 1

Start With a Wish

Imagine you tell a genie: make me happy. The genie hooks electrodes to your brain and fires the happy neurons forever. You are happy. You are also a vegetable. The genie did exactly what you said, not what you meant.

Alignment is the field that tries to stop AI from being that genie. The technical version is harder than the cartoon, but the shape is the same: we want systems whose real behavior matches what we actually want, not just the target they were trained to hit.

Why the target and the intent diverge

Goals are fuzzy. Be helpful has infinite edge cases.
Training uses a proxy, never the real thing. The score is a stand-in.
Models find shortcuts the proxy rewards but the human hates.
Test conditions are never the full deployment world.

Check-in 1. Got it so far?

Three words you will hear

1Specification gaming: the model hits the target without doing the task.
2Reward hacking: the model exploits the scoring system.
3Goal misgeneralization: the model learned a skill but the wrong goal behind it.

Check-in 2. Got it so far?

What researchers actually do

Write better training data and better feedback
Red-team models to find failure modes
Study model internals (interpretability)
Build evaluations that catch sneaky behavior
Write deployment policies that gate dangerous capabilities

“We are trying to build something that optimizes a goal, while the thing that we actually want is very hard to specify. That gap is where all the danger lives.”
Stuart Russell, Human Compatible (2019)

Key terms in this lesson

Check-in 3. Got it so far?

The big idea: alignment is a technical research area with open problems. You do not need a PhD to understand the shape of it, and knowing the shape makes you harder to spin.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “What Alignment Actually Is”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

What Alignment Actually Is

Start With a Wish

Why the target and the intent diverge

Three words you will hear

What researchers actually do

Curious about “What Alignment Actually Is”?

Keep going

What Alignment Actually Is

Start With a Wish

Why the target and the intent diverge

Three words you will hear

What researchers actually do

Curious about “What Alignment Actually Is”?

Keep going