Weak-to-Strong Generalization

What if you have to supervise a student smarter than you? OpenAI's 2023 paper asked that question by using GPT-2 to train GPT-4. The results were surprising.

38 min · Reviewed 2026

The Inverse Tutor Problem

Human teachers will eventually be the weak party. If we want to keep supervising models that are better than us at most tasks, we need to know: does a stronger model, trained on weaker labels, stay stuck at the weak supervisor's ceiling, or can it generalize past it?

The experiment

In December 2023, OpenAI published Weak-to-Strong Generalization. They simulated the future problem today by using GPT-2 as the weak supervisor and GPT-4 as the strong student. GPT-2 generated labels, sometimes wrong ones, and GPT-4 was fine-tuned on them.

Baseline: GPT-4 with perfect labels performs at ceiling
Weak-label training: GPT-4 performance drops, but stays well above GPT-2's level
With training tweaks (auxiliary confidence loss): GPT-4 recovers much of the ceiling gap
The strong model partially 'knows better' than its teacher

Why this is not a solution

Results are uneven across tasks, especially ones requiring long reasoning
Performance gap persists for harder domains like chess endgames
The method assumes the strong model already has the right knowledge in pretraining — not strongly expected after review for future capabilities
Honesty and deception might behave differently than raw performance

Weak-to-strong is not the answer. It is evidence that the shape of the answer might exist.
— Collin Burns, paper co-author, interview (paraphrased)

The big idea: smarter students might partially teach themselves from weaker teachers. That is encouraging but far from sufficient for the real superhuman case.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-weak-to-strong-creators

What is the core idea behind "Weak-to-Strong Generalization"?
1. What if you have to supervise a student smarter than you? OpenAI's 2023 paper asked that question by using GPT-2 to train GPT-4. The results were surprising.
2. Red-team findings — summarized at best, rarely reproducible
3. Hiroshima Process
4. Third-party auditing of safety practices
Which term best describes a foundational idea in "Weak-to-Strong Generalization"?
1. superalignment
2. weak-to-strong
3. auxiliary loss
4. capability elicitation
A learner studying Weak-to-Strong Generalization would need to understand which concept?
1. weak-to-strong
2. auxiliary loss
3. superalignment
4. capability elicitation
Which of these is directly relevant to Weak-to-Strong Generalization?
1. weak-to-strong
2. superalignment
3. capability elicitation
4. auxiliary loss
Which of the following is a key point about Weak-to-Strong Generalization?
1. Baseline: GPT-4 with perfect labels performs at ceiling
2. Weak-label training: GPT-4 performance drops, but stays well above GPT-2's level
3. With training tweaks (auxiliary confidence loss): GPT-4 recovers much of the ceiling gap
4. The strong model partially 'knows better' than its teacher
Which of these does NOT belong in a discussion of Weak-to-Strong Generalization?
1. Weak-label training: GPT-4 performance drops, but stays well above GPT-2's level
2. Red-team findings — summarized at best, rarely reproducible
3. Baseline: GPT-4 with perfect labels performs at ceiling
4. With training tweaks (auxiliary confidence loss): GPT-4 recovers much of the ceiling gap
Which statement is accurate regarding Weak-to-Strong Generalization?
1. Performance gap persists for harder domains like chess endgames
2. The method assumes the strong model already has the right knowledge in pretraining — not strongly ex…
3. Results are uneven across tasks, especially ones requiring long reasoning
4. Honesty and deception might behave differently than raw performance
Which of these does NOT belong in a discussion of Weak-to-Strong Generalization?
1. Performance gap persists for harder domains like chess endgames
2. The method assumes the strong model already has the right knowledge in pretraining — not strongly ex…
3. Red-team findings — summarized at best, rarely reproducible
4. Results are uneven across tasks, especially ones requiring long reasoning
What is the key insight about "Why this is a clue" in the context of Weak-to-Strong Generalization?
1. The strong model does not simply copy the weak model's mistakes.
2. Red-team findings — summarized at best, rarely reproducible
3. Hiroshima Process
4. Third-party auditing of safety practices
What is the key insight about "The superalignment question" in the context of Weak-to-Strong Generalization?
1. Red-team findings — summarized at best, rarely reproducible
2. OpenAI's superalignment team, built around this research, was dissolved in May 2024 after internal disputes.
3. Hiroshima Process
4. Third-party auditing of safety practices
Which statement accurately describes an aspect of Weak-to-Strong Generalization?
1. Red-team findings — summarized at best, rarely reproducible
2. Hiroshima Process
3. Human teachers will eventually be the weak party. If we want to keep supervising models that are better than us at most tasks, we need to kn…
4. Third-party auditing of safety practices
What does working with Weak-to-Strong Generalization typically involve?
1. Red-team findings — summarized at best, rarely reproducible
2. Hiroshima Process
3. Third-party auditing of safety practices
4. In December 2023, OpenAI published Weak-to-Strong Generalization. They simulated the future problem today by using GPT-2 as the weak supervi…
Which of the following is true about Weak-to-Strong Generalization?
1. The big idea: smarter students might partially teach themselves from weaker teachers.
2. Red-team findings — summarized at best, rarely reproducible
3. Hiroshima Process
4. Third-party auditing of safety practices
Which best describes the scope of "Weak-to-Strong Generalization"?
1. It is unrelated to ethics workflows
2. It focuses on What if you have to supervise a student smarter than you? OpenAI's 2023 paper asked that question by
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Weak-to-Strong Generalization?
1. Red-team findings — summarized at best, rarely reproducible
2. Hiroshima Process
3. The experiment
4. Third-party auditing of safety practices

← Back to interactive lesson

Tendril · Creators · Ethics & Society

Weak-to-Strong Generalization

What if you have to supervise a student smarter than you? OpenAI's 2023 paper asked that question by using GPT-2 to train GPT-4. The results were surprising.

38 min · Reviewed 2026

The Inverse Tutor Problem

The experiment

Baseline: GPT-4 with perfect labels performs at ceiling
Weak-label training: GPT-4 performance drops, but stays well above GPT-2's level
With training tweaks (auxiliary confidence loss): GPT-4 recovers much of the ceiling gap
The strong model partially 'knows better' than its teacher

Why this is not a solution

Results are uneven across tasks, especially ones requiring long reasoning
Performance gap persists for harder domains like chess endgames
The method assumes the strong model already has the right knowledge in pretraining — not strongly expected after review for future capabilities
Honesty and deception might behave differently than raw performance

Weak-to-strong is not the answer. It is evidence that the shape of the answer might exist.
— Collin Burns, paper co-author, interview (paraphrased)

The big idea: smarter students might partially teach themselves from weaker teachers. That is encouraging but far from sufficient for the real superhuman case.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-weak-to-strong-creators

What is the core idea behind "Weak-to-Strong Generalization"?
1. What if you have to supervise a student smarter than you? OpenAI's 2023 paper asked that question by using GPT-2 to train GPT-4. The results were surprising.
2. Red-team findings — summarized at best, rarely reproducible
3. Hiroshima Process
4. Third-party auditing of safety practices
Which term best describes a foundational idea in "Weak-to-Strong Generalization"?
1. superalignment
2. weak-to-strong
3. auxiliary loss
4. capability elicitation
A learner studying Weak-to-Strong Generalization would need to understand which concept?
1. weak-to-strong
2. auxiliary loss
3. superalignment
4. capability elicitation
Which of these is directly relevant to Weak-to-Strong Generalization?
1. weak-to-strong
2. superalignment
3. capability elicitation
4. auxiliary loss
Which of the following is a key point about Weak-to-Strong Generalization?
1. Baseline: GPT-4 with perfect labels performs at ceiling
2. Weak-label training: GPT-4 performance drops, but stays well above GPT-2's level
3. With training tweaks (auxiliary confidence loss): GPT-4 recovers much of the ceiling gap
4. The strong model partially 'knows better' than its teacher
Which of these does NOT belong in a discussion of Weak-to-Strong Generalization?
1. Weak-label training: GPT-4 performance drops, but stays well above GPT-2's level
2. Red-team findings — summarized at best, rarely reproducible
3. Baseline: GPT-4 with perfect labels performs at ceiling
4. With training tweaks (auxiliary confidence loss): GPT-4 recovers much of the ceiling gap
Which statement is accurate regarding Weak-to-Strong Generalization?
1. Performance gap persists for harder domains like chess endgames
2. The method assumes the strong model already has the right knowledge in pretraining — not strongly ex…
3. Results are uneven across tasks, especially ones requiring long reasoning
4. Honesty and deception might behave differently than raw performance
Which of these does NOT belong in a discussion of Weak-to-Strong Generalization?
1. Performance gap persists for harder domains like chess endgames
2. The method assumes the strong model already has the right knowledge in pretraining — not strongly ex…
3. Red-team findings — summarized at best, rarely reproducible
4. Results are uneven across tasks, especially ones requiring long reasoning
What is the key insight about "Why this is a clue" in the context of Weak-to-Strong Generalization?
1. The strong model does not simply copy the weak model's mistakes.
2. Red-team findings — summarized at best, rarely reproducible
3. Hiroshima Process
4. Third-party auditing of safety practices
What is the key insight about "The superalignment question" in the context of Weak-to-Strong Generalization?
1. Red-team findings — summarized at best, rarely reproducible
2. OpenAI's superalignment team, built around this research, was dissolved in May 2024 after internal disputes.
3. Hiroshima Process
4. Third-party auditing of safety practices
Which statement accurately describes an aspect of Weak-to-Strong Generalization?
1. Red-team findings — summarized at best, rarely reproducible
2. Hiroshima Process
3. Human teachers will eventually be the weak party. If we want to keep supervising models that are better than us at most tasks, we need to kn…
4. Third-party auditing of safety practices
What does working with Weak-to-Strong Generalization typically involve?
1. Red-team findings — summarized at best, rarely reproducible
2. Hiroshima Process
3. Third-party auditing of safety practices
4. In December 2023, OpenAI published Weak-to-Strong Generalization. They simulated the future problem today by using GPT-2 as the weak supervi…
Which of the following is true about Weak-to-Strong Generalization?
1. The big idea: smarter students might partially teach themselves from weaker teachers.
2. Red-team findings — summarized at best, rarely reproducible
3. Hiroshima Process
4. Third-party auditing of safety practices
Which best describes the scope of "Weak-to-Strong Generalization"?
1. It is unrelated to ethics workflows
2. It focuses on What if you have to supervise a student smarter than you? OpenAI's 2023 paper asked that question by
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Weak-to-Strong Generalization?
1. Red-team findings — summarized at best, rarely reproducible
2. Hiroshima Process
3. The experiment
4. Third-party auditing of safety practices

← Back to interactive lesson