Loading lesson…
Alignment is not one thing. Some safety lives in training (RLHF, constitution). Some lives at runtime (system prompts, classifiers, filters). Understanding the split tells you where a given failure actually came from.
When a model refuses a harmful request, you see a single refusal. Inside the stack, that refusal came from at least two places. The weights were trained to refuse. And a runtime system added guardrails on top of those weights.
| Layer | What happens | If it fails |
|---|---|---|
| Input classifier | Flags 'phishing' keyword | Request reaches the model |
| System prompt | Reminds model of policy | Model may still refuse from training |
| Model weights | Trained to refuse social-engineering help | Output moves downstream |
| Output classifier | Scans generated text for harmful markers | User receives the bad content |
Model safety is not a property of the weights. It is a property of the deployment.
— Anthropic Responsible Scaling Policy framing
The big idea: if you see safety and think 'model,' you are missing most of the picture. A deployed AI's behavior is a stack, and each layer has different costs, trade-offs, and failure modes.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-training-vs-inference-alignment-creators
What is the core idea behind "Training-Time vs. Inference-Time Alignment"?
Which term best describes a foundational idea in "Training-Time vs. Inference-Time Alignment"?
A learner studying Training-Time vs. Inference-Time Alignment would need to understand which concept?
Which of these is directly relevant to Training-Time vs. Inference-Time Alignment?
Which of the following is a key point about Training-Time vs. Inference-Time Alignment?
Which of these does NOT belong in a discussion of Training-Time vs. Inference-Time Alignment?
Which statement is accurate regarding Training-Time vs. Inference-Time Alignment?
Which of these does NOT belong in a discussion of Training-Time vs. Inference-Time Alignment?
What is the key insight about "Why use both" in the context of Training-Time vs. Inference-Time Alignment?
What is the key insight about "The jailbreak split" in the context of Training-Time vs. Inference-Time Alignment?
Which statement accurately describes an aspect of Training-Time vs. Inference-Time Alignment?
What does working with Training-Time vs. Inference-Time Alignment typically involve?
Which best describes the scope of "Training-Time vs. Inference-Time Alignment"?
Which section heading best belongs in a lesson about Training-Time vs. Inference-Time Alignment?
Which section heading best belongs in a lesson about Training-Time vs. Inference-Time Alignment?