Feature Discovery in LLMs

A feature is a direction in activation space that corresponds to a concept. Finding them — naming them, ranking them, connecting them — is one of the central activities of interpretability research.

37 min · Reviewed 2026

What Counts as a Feature

Informally, a feature is anything the network appears to 'represent' in its activations. More formally, it's a direction in activation space whose magnitude correlates with the presence of some input property — a word, a concept, a category, a sentiment.

Ways researchers find them

Activation maximization: find inputs that make a neuron or direction fire hardest
Sparse autoencoders: learn a dictionary of features from activations
Probing: train a small classifier to detect a concept from activations
Difference in means: compare activations across two labeled sets
Steering vectors: find a direction that, when added to activations, changes output in a specific way

Named features in the wild (as of ~2024-2025)

Truthfulness and deception directions
Refusal features that activate before a safety refusal
Sycophancy features tied to agreeing with users
Entity-tracking features for names and pronouns
Code-vulnerability features
Sandbagging-related features

Every feature we name is a hypothesis. The test is whether intervening on it does what we expect.
— Neel Nanda, DeepMind

The big idea: features are the unit of analysis in modern interpretability. Learning to read them is the closest thing the field has to learning to read code.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-feature-discovery-creators

In the context of LLM interpretability, what is a 'feature'?
1. A weight matrix that controls attention between tokens
2. A type of training data that improves model safety
3. A specific neuron that always fires when certain keywords appear
4. A direction in activation space whose magnitude correlates with the presence of some input property
What does 'activation maximization' refer to in feature discovery?
1. Reducing the number of active neurons in a layer
2. Comparing activations from truthful versus deceptive inputs
3. Measuring the average activation of a neuron across a training dataset
4. Finding inputs that make a neuron or direction fire as hard as possible
How do sparse autoencoders contribute to feature discovery?
1. They maximize the sparsity of neural network weights during training
2. They learn a dictionary of features directly from observed activations
3. They automatically label features based on the input data
4. They compare activations between two different model architectures
What is the 'probing' method for finding features?
1. Calculating the mean activation for each neuron in a layer
2. Inserting probe tokens into the input to observe their attention patterns
3. Training a small classifier to detect whether a concept is present in activations
4. Using gradient descent to maximize the difference between two activation sets
The 'difference in means' method for feature discovery involves:
1. Finding inputs where activation equals zero
2. Comparing average activations across two labeled sets of inputs
3. Calculating the difference between forward and backward passes
4. Computing the mean activation of each neuron individually
What are 'steering vectors' used for in interpretability research?
1. Adding a specific direction to model activations to deliberately change output behavior
2. Directing the model's attention toward particular tokens in the input
3. Steering the model away from certain training examples
4. Calculating the gradient of the loss function during training
Which research discovery identified 'induction heads' as attention patterns that implement in-context learning?
1. Olsson et al. 2022
2. Nanda et al. 2023
3. Marks and Tegmark 2023
4. Radford et al. 2019
What did Nanda et al. 2023 accomplish in interpretability research?
1. Discovered steering vectors for honesty in GPT models
2. Reverse-engineered how a small transformer performs modular addition
3. Found that sparse autoencoders can decode all model behavior
4. Proved that features are impossible to interpret in large models
Marks and Tegmark 2023 identified directions in LLM activations that correspond to:
1. Deceptive versus honest responses
2. English versus non-English languages
3. Technical versus non-technical content
4. True versus false statements
What are 'refusal features' in LLM interpretability?
1. Directions that cause the model to refuse any request
2. Weights that control whether the model can refuse to answer
3. Activation patterns that become active just before a safety refusal is generated
4. Features that detect when users are refusing to follow instructions
A key caveat about named features in interpretability is that:
1. Named features are always accurate representations of model behavior
2. Only features that match human concepts should be reported
3. The name reflects the researcher's hypothesis, not necessarily the model's internal concept
4. Features should never be named to avoid biasing results
'Code-vulnerability features' in LLM research refer to:
1. Directions that cause models to generate vulnerable code
2. Weights that prevent models from writing code at all
3. Activation patterns that correlate with security vulnerabilities in source code
4. Features that detect when code is being written incorrectly
The lesson describes features as what in modern interpretability?
1. The primary training objective
2. A type of neural network architecture
3. The fundamental unit of analysis
4. A method for collecting training data
A researcher finds a feature that fires strongly on examples of deceptive statements from training data. What important caution does this illustrate?
1. The feature should be removed from the model immediately
2. The feature may detect 'deception example' rather than actual deception
3. This proves the model has a theory of mind
4. The model has definitely learned to be deceptive
Which feature discovery method is best described as 'optimizing input to maximize activation'?
1. Activation maximization
2. Sparse autoencoders
3. Probing
4. Difference in means

← Back to interactive lesson

Tendril · Creators · Ethics & Society

Feature Discovery in LLMs

37 min · Reviewed 2026

What Counts as a Feature

Ways researchers find them

Activation maximization: find inputs that make a neuron or direction fire hardest
Sparse autoencoders: learn a dictionary of features from activations
Probing: train a small classifier to detect a concept from activations
Difference in means: compare activations across two labeled sets
Steering vectors: find a direction that, when added to activations, changes output in a specific way

Named features in the wild (as of ~2024-2025)

Truthfulness and deception directions
Refusal features that activate before a safety refusal
Sycophancy features tied to agreeing with users
Entity-tracking features for names and pronouns
Code-vulnerability features
Sandbagging-related features

Every feature we name is a hypothesis. The test is whether intervening on it does what we expect.
— Neel Nanda, DeepMind

The big idea: features are the unit of analysis in modern interpretability. Learning to read them is the closest thing the field has to learning to read code.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-feature-discovery-creators

In the context of LLM interpretability, what is a 'feature'?
1. A weight matrix that controls attention between tokens
2. A type of training data that improves model safety
3. A specific neuron that always fires when certain keywords appear
4. A direction in activation space whose magnitude correlates with the presence of some input property
What does 'activation maximization' refer to in feature discovery?
1. Reducing the number of active neurons in a layer
2. Comparing activations from truthful versus deceptive inputs
3. Measuring the average activation of a neuron across a training dataset
4. Finding inputs that make a neuron or direction fire as hard as possible
How do sparse autoencoders contribute to feature discovery?
1. They maximize the sparsity of neural network weights during training
2. They learn a dictionary of features directly from observed activations
3. They automatically label features based on the input data
4. They compare activations between two different model architectures
What is the 'probing' method for finding features?
1. Calculating the mean activation for each neuron in a layer
2. Inserting probe tokens into the input to observe their attention patterns
3. Training a small classifier to detect whether a concept is present in activations
4. Using gradient descent to maximize the difference between two activation sets
The 'difference in means' method for feature discovery involves:
1. Finding inputs where activation equals zero
2. Comparing average activations across two labeled sets of inputs
3. Calculating the difference between forward and backward passes
4. Computing the mean activation of each neuron individually
What are 'steering vectors' used for in interpretability research?
1. Adding a specific direction to model activations to deliberately change output behavior
2. Directing the model's attention toward particular tokens in the input
3. Steering the model away from certain training examples
4. Calculating the gradient of the loss function during training
Which research discovery identified 'induction heads' as attention patterns that implement in-context learning?
1. Olsson et al. 2022
2. Nanda et al. 2023
3. Marks and Tegmark 2023
4. Radford et al. 2019
What did Nanda et al. 2023 accomplish in interpretability research?
1. Discovered steering vectors for honesty in GPT models
2. Reverse-engineered how a small transformer performs modular addition
3. Found that sparse autoencoders can decode all model behavior
4. Proved that features are impossible to interpret in large models
Marks and Tegmark 2023 identified directions in LLM activations that correspond to:
1. Deceptive versus honest responses
2. English versus non-English languages
3. Technical versus non-technical content
4. True versus false statements
What are 'refusal features' in LLM interpretability?
1. Directions that cause the model to refuse any request
2. Weights that control whether the model can refuse to answer
3. Activation patterns that become active just before a safety refusal is generated
4. Features that detect when users are refusing to follow instructions
A key caveat about named features in interpretability is that:
1. Named features are always accurate representations of model behavior
2. Only features that match human concepts should be reported
3. The name reflects the researcher's hypothesis, not necessarily the model's internal concept
4. Features should never be named to avoid biasing results
'Code-vulnerability features' in LLM research refer to:
1. Directions that cause models to generate vulnerable code
2. Weights that prevent models from writing code at all
3. Activation patterns that correlate with security vulnerabilities in source code
4. Features that detect when code is being written incorrectly
The lesson describes features as what in modern interpretability?
1. The primary training objective
2. A type of neural network architecture
3. The fundamental unit of analysis
4. A method for collecting training data
A researcher finds a feature that fires strongly on examples of deceptive statements from training data. What important caution does this illustrate?
1. The feature should be removed from the model immediately
2. The feature may detect 'deception example' rather than actual deception
3. This proves the model has a theory of mind
4. The model has definitely learned to be deceptive
Which feature discovery method is best described as 'optimizing input to maximize activation'?
1. Activation maximization
2. Sparse autoencoders
3. Probing
4. Difference in means

← Back to interactive lesson