Lesson 223 of 2116
Weak-to-Strong Generalization
What if you have to supervise a student smarter than you? OpenAI's 2023 paper asked that question by using GPT-2 to train GPT-4. The results were surprising.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Inverse Tutor Problem
- 2weak-to-strong
- 3generalization
- 4superalignment
Concept cluster
Terms to connect while reading
Section 1
The Inverse Tutor Problem
Human teachers will eventually be the weak party. If we want to keep supervising models that are better than us at most tasks, we need to know: does a stronger model, trained on weaker labels, stay stuck at the weak supervisor's ceiling, or can it generalize past it?
The experiment
In December 2023, OpenAI published Weak-to-Strong Generalization. They simulated the future problem today by using GPT-2 as the weak supervisor and GPT-4 as the strong student. GPT-2 generated labels, sometimes wrong ones, and GPT-4 was fine-tuned on them.
- Baseline: GPT-4 with perfect labels performs at ceiling
- Weak-label training: GPT-4 performance drops, but stays well above GPT-2's level
- With training tweaks (auxiliary confidence loss): GPT-4 recovers much of the ceiling gap
- The strong model partially 'knows better' than its teacher
Why this is not a solution
- 1Results are uneven across tasks, especially ones requiring long reasoning
- 2Performance gap persists for harder domains like chess endgames
- 3The method assumes the strong model already has the right knowledge in pretraining — not strongly expected after review for future capabilities
- 4Honesty and deception might behave differently than raw performance
“Weak-to-strong is not the answer. It is evidence that the shape of the answer might exist.”
Key terms in this lesson
The big idea: smarter students might partially teach themselves from weaker teachers. That is encouraging but far from sufficient for the real superhuman case.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Weak-to-Strong Generalization”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 45 min
Scalable Oversight: How Do You Supervise What You Cannot Evaluate
Debate, amplification, weak-to-strong, process supervision. Research on how humans supervise models smarter than them.
Creators · 50 min
AI Alignment: The Actual Technical Problem
Alignment is not a vibes debate. It is a concrete technical problem about getting systems to pursue goals we actually want. Here is what researchers work on when they say they work on alignment.
Creators · 40 min
Jailbreak Case Studies: What Actually Broke
Abstract jailbreak theory is less useful than real cases. Here are the techniques that worked on production models, what they taught us, and what is still unsolved.
