neural-forge.io

Sign inStartStart learning

Tendril

Ethics & Society0%

Lesson 855 of 2116

Training-Time vs. Inference-Time Alignment

Alignment is not one thing. Some safety lives in training (RLHF, constitution). Some lives at runtime (system prompts, classifiers, filters). Understanding the split tells you where a given failure actually came from.

CreatorsEthics & Society~22 min readAdvancedBI5 · Societal ImpactBI3 · LearningPrint / PDF

Lesson map

What this lesson covers

36 min15 blocks3 concepts

Learning path

The main moves in order

1Two Layers of Safety
2training-time alignment
3inference-time alignment
4defense in depth

Concept cluster

Terms to connect while reading

training-time alignmentinference-time alignmentdefense in depth

Read2

Sections4

Lists2

Notes4

Compare1

Quotes1

Section 1

Two Layers of Safety

When a model refuses a harmful request, you see a single refusal. Inside the stack, that refusal came from at least two places. The weights were trained to refuse. And a runtime system added guardrails on top of those weights.

Training-time alignment

RLHF / RLAIF: reward-model-driven fine-tuning
Constitutional AI: model critiques itself against written principles
DPO and variants: direct preference optimization
Supervised fine-tuning on curated refusals
Red-team iteration: find failure, patch with training data

Inference-time alignment

System prompts and role instructions
Input and output classifiers (OpenAI Moderation, Llama Guard, ShieldGemma)
Rate limits and usage monitoring
Tool use restrictions and sandboxing
Watermarking and content provenance

Check-in 1. Got it so far?

Example: a phishing-email request

Compare the options

Layer	What happens	If it fails
Input classifier	Flags 'phishing' keyword	Request reaches the model
System prompt	Reminds model of policy	Model may still refuse from training
Model weights	Trained to refuse social-engineering help	Output moves downstream
Output classifier	Scans generated text for harmful markers	User receives the bad content

Check-in 2. Got it so far?

“Model safety is not a property of the weights. It is a property of the deployment.”
Anthropic Responsible Scaling Policy framing

Key terms in this lesson

The big idea: if you see safety and think 'model,' you are missing most of the picture. A deployed AI's behavior is a stack, and each layer has different costs, trade-offs, and failure modes.

Check-in 3. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Training-Time vs. Inference-Time Alignment”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going