Logit Lens: Peeking at Predictions Mid-Forward-Pass

A transformer processes a token through many layers before outputting a prediction. The logit lens shows you what the model would predict if it stopped at each layer along the way.

25 min · Reviewed 2026

A Diagnostic Probe for the Residual Stream

Transformers build up predictions layer by layer. Each layer reads the residual stream — a running hidden state — and writes a correction. The logit lens technique, popularized by a 2020 LessWrong post by nostalgebraist, is to apply the model's final unembedding matrix to intermediate layers, as if prediction happened there.

What you see

Early layers: predictions close to the input token or simple patterns
Middle layers: predictions related to the general category or topic
Later layers: predictions refine toward the correct next token
Near the end: the final answer crystallizes

Variants and refinements

Tuned lens (Belrose et al. 2023): train small translators per layer for more accurate reading
Logit difference: compare predictions for target vs. distractor tokens
Direct logit attribution: decompose which components contributed to a prediction
Patchscopes: use a stronger model to interpret activations from a weaker one

The big idea: the logit lens is one of the cheapest interpretability tools. One line of code gives you a new window into how a transformer thinks.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-logit-lens-builders

What does the logit lens technique do?
1. Trains new weights for each layer of the transformer
2. Removes the residual stream to simplify processing
3. Replaces the attention mechanism with direct connections
4. Applies the model's final unembedding matrix to intermediate layer activations
What would you expect to see when applying the logit lens to early layers of a transformer?
1. Predictions close to the input token or simple patterns
2. Complete mathematical gibberish with no token structure
3. Predictions about unrelated random topics
4. The exact final correct answer
A transformer predicts token A at layer 10 but predicts the correct token B at layer 25. What does this demonstrate?
1. The early layers contained the correct answer but chose not to output it
2. The residual stream was erased between layers 10 and 25
3. The model is malfunctioning and producing inconsistent results
4. Later processing corrected or added information to reach the right answer
If a wrong prediction stays wrong from early layers through late layers, what can you conclude?
1. Early layers should be removed to fix the issue
2. The model has learned the wrong thing completely
3. The logit lens technique has failed permanently
4. The necessary information was never present in the residual stream or was erased
How does the 'tuned lens' improve upon the basic logit lens?
1. It uses a completely different type of neural network
2. It trains small translator networks per layer for more accurate reading
3. It removes the need for any matrix operations
4. It applies the same unembedding to all layers without adjustment
What is the 'residual stream' in a transformer architecture?
1. The final output layer that produces token predictions
2. The initial token embedding that starts the forward pass
3. A running hidden state that each layer reads from and writes corrections to
4. A type of attention head that connects distant tokens
If the logit lens reads what appears to be gibberish from early layers, what might be happening?
1. Early layers contain absolutely no usable information
2. The residual stream has been deleted
3. The representations have changed basis across layers, making the lens inapplicable
4. The model has completely failed and produces no meaningful output
What is the purpose of 'logit difference' when using interpretability tools?
1. Subtracting the input embeddings from the final output
2. Adding the logit values from all layers together
3. Comparing the model's prediction scores for a target token versus a distractor token
4. Multiplying predictions across multiple models
What are 'patchscopes' in interpretability research?
1. A method for connecting different transformer layers directly
2. A technique for removing unnecessary model parameters
3. Using a stronger model to interpret or make sense of activations from a weaker model
4. Patching software bugs in transformer architectures
Why is the logit lens described as 'one of the cheapest interpretability tools'?
1. It can only be used on small models
2. It requires only a single line of code to implement
3. It uses no computational resources whatsoever
4. It requires no access to model weights
What assumption does the logit lens make about early-layer activations?
1. That the final unembedding matrix can be meaningfully applied to them
2. That they contain perfect representations of the final answer
3. That they are identical to late-layer activations
4. That they contain no information whatsoever
In the middle layers of a transformer, what type of predictions does the logit lens typically reveal?
1. Predictions that are always wrong
2. The exact correct next token in all cases
3. Random predictions unrelated to the input
4. Predictions related to the general category or topic of the input
What does 'direct logit attribution' do?
1. Decomposes which model components contributed to a particular prediction
2. Removes the logit layer entirely from the model
3. Creates new logits for different tasks
4. Changes the logits to improve model accuracy
According to the text, who first popularized the logit lens technique?
1. Researchers at a major AI laboratory
2. The developers of GPT models
3. A user named nostalgebraist on the LessWrong forum
4. The authors of the original transformer paper
What happens in the layers 'near the end' of a transformer according to the logit lens observations?
1. Information starts being erased from the stream
2. The final answer crystallizes into its precise form
3. Predictions become more random and unpredictable
4. The model begins initial processing of the input

← Back to interactive lesson

Tendril · Builders · Ethics & Society