Training-Time vs. Inference-Time Alignment

Alignment is not one thing. Some safety lives in training (RLHF, constitution). Some lives at runtime (system prompts, classifiers, filters). Understanding the split tells you where a given failure actually came from.

36 min · Reviewed 2026

Two Layers of Safety

When a model refuses a harmful request, you see a single refusal. Inside the stack, that refusal came from at least two places. The weights were trained to refuse. And a runtime system added guardrails on top of those weights.

Training-time alignment

RLHF / RLAIF: reward-model-driven fine-tuning
Constitutional AI: model critiques itself against written principles
DPO and variants: direct preference optimization
Supervised fine-tuning on curated refusals
Red-team iteration: find failure, patch with training data

Inference-time alignment

System prompts and role instructions
Input and output classifiers (OpenAI Moderation, Llama Guard, ShieldGemma)
Rate limits and usage monitoring
Tool use restrictions and sandboxing
Watermarking and content provenance

Example: a phishing-email request

Layer	What happens	If it fails
Input classifier	Flags 'phishing' keyword	Request reaches the model
System prompt	Reminds model of policy	Model may still refuse from training
Model weights	Trained to refuse social-engineering help	Output moves downstream
Output classifier	Scans generated text for harmful markers	User receives the bad content

Model safety is not a property of the weights. It is a property of the deployment.
— Anthropic Responsible Scaling Policy framing

The big idea: if you see safety and think 'model,' you are missing most of the picture. A deployed AI's behavior is a stack, and each layer has different costs, trade-offs, and failure modes.

End-of-lesson check

8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-training-vs-inference-alignment-creators

What is the main idea of "Training-Time vs. Inference-Time Alignment"?
1. Alignment is not one thing.
2. Use AI as the final authority for the whole decision
3. Avoid checking the answer once it sounds polished
4. Focus only on speed instead of judgment
Which concept is most central to "Training-Time vs. Inference-Time Alignment"?
1. inference-time alignment
2. training-time alignment
3. defense in depth
4. classifier
Which use of AI fits this topic best?
1. Let the AI decide what matters without your review
2. Use the answer before checking whether it fits the situation
3. RLHF / RLAIF: reward-model-driven fine-tuning
4. Treat the AI output as automatically correct
What should a careful learner remember about "Why use both"?
1. Use AI to draft or organize ideas about training-time alignment, then verify before acting.
2. Skip the context so the tool can guess faster
3. Treat the output as private even after sharing it online
4. Use the answer without checking the source
You want to use AI after this lesson. What is the safest next step?
1. Act immediately because the AI answer is written clearly
2. AI cannot make the human values decision for you.
3. Hide uncertainty so the final answer looks cleaner
4. Use private or sensitive details before checking permission
How should AI output about training-time alignment be treated?
1. As proof that no other source is needed
2. As a replacement for context, consent, or expert review
3. As a draft or helper output that still needs human judgment and verification
4. As something that becomes correct when it sounds confident
Name one way to verify an AI answer about training-time alignment.
Which action would help you apply "Training-Time vs. Inference-Time Alignment" responsibly?
1. Use the tool to avoid thinking through the tradeoff
2. Keep going even if the output conflicts with a trusted source
3. Treat the AI output as automatically correct
4. Constitutional AI: model critiques itself against written principles

← Back to interactive lesson

Tendril · Creators · Ethics & Society