neural-forge.io

Sign inStartStart learning

Tendril

Ethics & Society0%

Lesson 27 of 2116

Responsible Scaling Policies Explained

RSPs are the frontier labs' self-imposed rules for what capability thresholds trigger which safeguards. Here is what they commit to, what they hedge on, and what the enforcement problem is.

CreatorsEthics & Society~27 min readAdvancedOperationsBI5 · Societal ImpactBI1 · PerceptionPrint / PDF

Lesson map

What this lesson covers

45 min23 blocks4 concepts

Learning path

The main moves in order

1A Self-Imposed Brake
2Responsible Scaling Policy
3ASL
4Preparedness Framework

Concept cluster

Terms to connect while reading

Responsible Scaling PolicyASLPreparedness Frameworkcapability thresholds

Read5

Sections8

Lists4

Notes3

Compare1

Quotes1

Section 1

A Self-Imposed Brake

A Responsible Scaling Policy (RSP) is a frontier lab's public commitment to pause or add safeguards when models cross defined capability thresholds. Anthropic introduced the concept in September 2023. OpenAI published its analog (Preparedness Framework) in December 2023. Google DeepMind followed with its Frontier Safety Framework in May 2024. Meta, xAI, and Cohere all have some version now.

The Anthropic RSP architecture (v2.2, 2025)

Anthropic's RSP defines AI Safety Levels (ASL) modeled on biosafety levels (BSL) for pathogens. Each level specifies capabilities and required safeguards.

ASL-1: No meaningful catastrophic risk (2018-era models, narrow chess AIs)
ASL-2: Early signs of dangerous capability but unreliable (most current LLMs)
ASL-3: Substantial uplift on CBRN misuse or meaningful autonomous AI R&D. Triggered May 2025 for Anthropic's frontier models. Requires stronger weight security and deployment restrictions.
ASL-4: Not yet defined concretely; associated with qualitative escalations in autonomy and misuse potential
ASL-5: Placeholder for far-future concerns

Check-in 1. Got it so far?

What capabilities actually trigger the next level?

1CBRN uplift: can the model meaningfully help a novice with a bioweapon?
2Cyber uplift: does it enable attacks that were previously infeasible?
3Autonomous replication: can it copy itself across machines without human help?
4AI R&D acceleration: can it fully automate junior-researcher AI work?
5Persuasion: does it meaningfully exceed human baselines in influence operations?

OpenAI's Preparedness Framework

OpenAI rates models across four risk categories (cybersecurity, CBRN, persuasion, model autonomy) on a low/medium/high/critical scale. Models scoring high can be deployed only with mitigations; models scoring critical cannot be trained further until mitigations work. Pre-mitigation score matters — mitigating a critical model back to medium does not let you train the next one without reviewing.

Compare: the three big-lab frameworks

Compare the options

Dimension	Anthropic RSP	OpenAI Preparedness	DeepMind FSF
Structure	Levels (ASL-1..5)	Categories × severity	Critical Capability Levels
Triggered by	Capability thresholds	Pre-mitigation scores	Evaluation outcomes
Can pause training?	Yes	Yes (critical)	Yes
External input	Board, policy team	Safety Advisory Group	Internal review
Weight security level	Escalates with ASL	Escalates with severity	Escalates

Check-in 2. Got it so far?

What RSPs get right

Specificity: concrete capability thresholds, not vague commitments
Pre-commitment: easier to honor restrictions announced before a race
Safeguard ladders: security and deployment measures scale with capability
Publicness: the document can be evaluated and critiqued by outsiders
Interoperability: Seoul Summit Commitments use RSP-style language

What critics hammer

Self-imposed: no external enforcement, can be amended by the company at will
Revisable mid-race: Anthropic moved some thresholds between RSP versions
No peer consistency: what triggers ASL-3 at Anthropic is not what triggers high at OpenAI
Eval quality: thresholds are only as good as the evaluations probing them — and evals are still immature
Incentive problem: the company running the evals has a business interest in passing

Check-in 3. Got it so far?

The relationship to law

The EU AI Act refers to GPAI Code of Practice signatories and treats RSP-style commitments as presumption-of-compliance for some systemic-risk obligations. The UK AISI MOUs with labs include pre-release evaluation rights that rely on the labs' own classifications. US Executive Order 14110 (2023) required dual-use foundation model reporting, modified by Trump administration EOs in early 2025 but with some evaluation requirements preserved. Soft commitments and hard law are converging, slowly.

“A commitment is not a guarantee. It is a promise plus a mechanism for catching yourself when you are about to break it.”
An RSP co-author

Key terms in this lesson

Check-in 4. Got it so far?

The big idea: RSPs are the frontier labs' admission that some capabilities should change how a model is handled. They are imperfect, self-enforced, and strictly better than nothing. Knowing their architecture lets you read any new safety announcement against the actual document it cites.

End-of-lesson quiz

Check what stuck

15 questions · Score saves to your progress.

Tutor

Curious about “Responsible Scaling Policies Explained”?

Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.

Progress saved locally in this browser. Sign in to sync across devices.

Related lessons

Keep going