Loading lesson…
A feature is a direction in activation space that corresponds to a concept. Finding them — naming them, ranking them, connecting them — is one of the central activities of interpretability research.
Informally, a feature is anything the network appears to 'represent' in its activations. More formally, it's a direction in activation space whose magnitude correlates with the presence of some input property — a word, a concept, a category, a sentiment.
Every feature we name is a hypothesis. The test is whether intervening on it does what we expect.
— Neel Nanda, DeepMind
The big idea: features are the unit of analysis in modern interpretability. Learning to read them is the closest thing the field has to learning to read code.
8 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-feature-discovery-creators
What is the main idea of "Feature Discovery in LLMs"?
Which concept is most central to "Feature Discovery in LLMs"?
Which use of AI fits this topic best?
What should a careful learner remember about "Landmark examples"?
You want to use AI after this lesson. What is the safest next step?
How should AI output about feature be treated?
Name one way to verify an AI answer about feature.
Which action would help you apply "Feature Discovery in LLMs" responsibly?