Loading lesson…
What if you have to supervise a student smarter than you? OpenAI's 2023 paper asked that question by using GPT-2 to train GPT-4. The results were surprising.
Human teachers will eventually be the weak party. If we want to keep supervising models that are better than us at most tasks, we need to know: does a stronger model, trained on weaker labels, stay stuck at the weak supervisor's ceiling, or can it generalize past it?
In December 2023, OpenAI published Weak-to-Strong Generalization. They simulated the future problem today by using GPT-2 as the weak supervisor and GPT-4 as the strong student. GPT-2 generated labels, sometimes wrong ones, and GPT-4 was fine-tuned on them.
Weak-to-strong is not the answer. It is evidence that the shape of the answer might exist.
— Collin Burns, paper co-author, interview (paraphrased)
The big idea: smarter students might partially teach themselves from weaker teachers. That is encouraging but far from sufficient for the real superhuman case.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-weak-to-strong-creators
What is the main idea of "Weak-to-Strong Generalization"?
Which concept is most central to "Weak-to-Strong Generalization"?
Which use of AI fits this topic best?
What should a careful learner remember about "Why this is a clue"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about weak-to-strong be treated?
Name one way to verify an AI answer about weak-to-strong.
Which action would help you apply "Weak-to-Strong Generalization" responsibly?