Loading lesson…
Probing asks a simple question: given a model's hidden state, can a small classifier predict some property? The answer tells you what the model represents, whether or not it uses that information.
Take a labeled dataset — sentences marked 'positive sentiment' vs 'negative,' or 'factual' vs 'hallucinated.' Extract the model's hidden activations on each input. Train a small classifier (a probe) to predict the label from the activations. If it succeeds, the model represents that information somewhere.
Probing tells you what is in the water. It does not tell you what the fish is doing.
— Anna Rogers, on interpretability research (paraphrased)
The big idea: probing is the cheapest, oldest tool in the interpretability kit. It is also still one of the most useful, especially when paired with intervention experiments.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-probing-creators
What is the main idea of "Probing: Linear, Nonlinear, and Contrast"?
Which concept is most central to "Probing: Linear, Nonlinear, and Contrast"?
Which use of AI fits this topic best?
What should a careful learner remember about "Why linear matters most"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about probing be treated?
Name one way to verify an AI answer about probing.
Which action would help you apply "Probing: Linear, Nonlinear, and Contrast" responsibly?