Loading lesson…
Alignment is not one thing. Some safety lives in training (RLHF, constitution). Some lives at runtime (system prompts, classifiers, filters). Understanding the split tells you where a given failure actually came from.
When a model refuses a harmful request, you see a single refusal. Inside the stack, that refusal came from at least two places. The weights were trained to refuse. And a runtime system added guardrails on top of those weights.
| Layer | What happens | If it fails |
|---|---|---|
| Input classifier | Flags 'phishing' keyword | Request reaches the model |
| System prompt | Reminds model of policy | Model may still refuse from training |
| Model weights | Trained to refuse social-engineering help | Output moves downstream |
| Output classifier | Scans generated text for harmful markers | User receives the bad content |
Model safety is not a property of the weights. It is a property of the deployment.
— Anthropic Responsible Scaling Policy framing
The big idea: if you see safety and think 'model,' you are missing most of the picture. A deployed AI's behavior is a stack, and each layer has different costs, trade-offs, and failure modes.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-training-vs-inference-alignment-creators
What is the main idea of "Training-Time vs. Inference-Time Alignment"?
Which concept is most central to "Training-Time vs. Inference-Time Alignment"?
Which use of AI fits this topic best?
What should a careful learner remember about "Why use both"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about training-time alignment be treated?
Name one way to verify an AI answer about training-time alignment.
Which action would help you apply "Training-Time vs. Inference-Time Alignment" responsibly?