Tendril

Lesson 500 of 2116

Safety Classifiers And Refusals On Frontier Models

Frontier models refuse some requests. Sometimes correctly, sometimes too aggressively. Understanding how refusals work changes how you prompt.

CreatorsModel Families~5 min readBI3 · LearningBI4 · Natural InteractionBI5 · Societal ImpactPrint / PDF

Lesson map

What this lesson covers

9 min14 blocks5 concepts

Learning path

The main moves in order

1Refusals are policy, not capability
2safety classifier
3refusal
4policy violation

Concept cluster

Terms to connect while reading

safety classifierrefusalpolicy violationsystem promptrephrase strategy

Sections3

Lists2

Notes5

Compare1

Terms1

Section 1

Refusals are policy, not capability

When a frontier model declines to help, it usually is not because it cannot. It is because a safety classifier — either built into the model's training or layered on top — flagged the request as policy-violating. Knowing where in the stack the refusal happens helps you adjust.

Layers that produce refusals

1Pre-classifier — input is checked against a policy classifier before reaching the model
2Trained refusal — the model itself was trained to decline some requests
3Post-classifier — the output is checked before being returned to the user
4System prompt enforcement — the deployment's system prompt may add stricter rules
5User-tier policy — your account or tier may have additional restrictions

Compare the options

Refusal type	How to spot it	What to try
Pre-classifier	Refused before any output	Rephrase the request
Trained refusal	Model explains its concern	Provide more legitimate context
Post-classifier	Refused after partial output	Reformat or split the task
System prompt	Same prompt works on raw API	Loosen your system prompt
User tier	Refused only for some users	Check tier limits

Check-in 1. Got it so far?

Applied exercise

1Find a recent refusal you got from a frontier model
2Diagnose: which layer of the stack produced it?
3Reframe the request with legitimate context
4If still refused, escalate to your vendor's enterprise support — sometimes they can add an allowlist for your account

Check-in 2. Got it so far?

Key terms in this lesson

The big idea: refusals are not bugs to bypass. They are signals to understand and route around legitimately.

Check-in 3. Got it so far?

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Safety Classifiers And Refusals On Frontier Models”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Safety Classifiers And Refusals On Frontier Models

Refusals are policy, not capability

Layers that produce refusals

Applied exercise

Curious about “Safety Classifiers And Refusals On Frontier Models”?

Keep going

Safety Classifiers And Refusals On Frontier Models

Refusals are policy, not capability

Layers that produce refusals

Applied exercise

Curious about “Safety Classifiers And Refusals On Frontier Models”?

Keep going