Lesson 855 of 2116
Training-Time vs. Inference-Time Alignment
Alignment is not one thing. Some safety lives in training (RLHF, constitution). Some lives at runtime (system prompts, classifiers, filters). Understanding the split tells you where a given failure actually came from.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Two Layers of Safety
- 2training-time alignment
- 3inference-time alignment
- 4defense in depth
Concept cluster
Terms to connect while reading
Section 1
Two Layers of Safety
When a model refuses a harmful request, you see a single refusal. Inside the stack, that refusal came from at least two places. The weights were trained to refuse. And a runtime system added guardrails on top of those weights.
Training-time alignment
- RLHF / RLAIF: reward-model-driven fine-tuning
- Constitutional AI: model critiques itself against written principles
- DPO and variants: direct preference optimization
- Supervised fine-tuning on curated refusals
- Red-team iteration: find failure, patch with training data
Inference-time alignment
- System prompts and role instructions
- Input and output classifiers (OpenAI Moderation, Llama Guard, ShieldGemma)
- Rate limits and usage monitoring
- Tool use restrictions and sandboxing
- Watermarking and content provenance
Example: a phishing-email request
Compare the options
| Layer | What happens | If it fails |
|---|---|---|
| Input classifier | Flags 'phishing' keyword | Request reaches the model |
| System prompt | Reminds model of policy | Model may still refuse from training |
| Model weights | Trained to refuse social-engineering help | Output moves downstream |
| Output classifier | Scans generated text for harmful markers | User receives the bad content |
“Model safety is not a property of the weights. It is a property of the deployment.”
Key terms in this lesson
The big idea: if you see safety and think 'model,' you are missing most of the picture. A deployed AI's behavior is a stack, and each layer has different costs, trade-offs, and failure modes.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Training-Time vs. Inference-Time Alignment”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
Creators · 40 min
Jailbreak Case Studies: What Actually Broke
Abstract jailbreak theory is less useful than real cases. Here are the techniques that worked on production models, what they taught us, and what is still unsolved.
Creators · 55 min
Alignment: The Full Technical Picture
What alignment actually is as a research program, how it is done in practice, what the open problems are, and where the actual papers live. A model that is always helpful will help you do harmful things.
