Loading lesson…
A feature is a direction in activation space that corresponds to a concept. Finding them — naming them, ranking them, connecting them — is one of the central activities of interpretability research.
Informally, a feature is anything the network appears to 'represent' in its activations. More formally, it's a direction in activation space whose magnitude correlates with the presence of some input property — a word, a concept, a category, a sentiment.
Every feature we name is a hypothesis. The test is whether intervening on it does what we expect.
— Neel Nanda, DeepMind
The big idea: features are the unit of analysis in modern interpretability. Learning to read them is the closest thing the field has to learning to read code.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-feature-discovery-creators
In the context of LLM interpretability, what is a 'feature'?
What does 'activation maximization' refer to in feature discovery?
How do sparse autoencoders contribute to feature discovery?
What is the 'probing' method for finding features?
The 'difference in means' method for feature discovery involves:
What are 'steering vectors' used for in interpretability research?
Which research discovery identified 'induction heads' as attention patterns that implement in-context learning?
What did Nanda et al. 2023 accomplish in interpretability research?
Marks and Tegmark 2023 identified directions in LLM activations that correspond to:
What are 'refusal features' in LLM interpretability?
A key caveat about named features in interpretability is that:
'Code-vulnerability features' in LLM research refer to:
The lesson describes features as what in modern interpretability?
A researcher finds a feature that fires strongly on examples of deceptive statements from training data. What important caution does this illustrate?
Which feature discovery method is best described as 'optimizing input to maximize activation'?