Logit Lens: Peeking at Predictions Mid-Forward-Pass
A transformer processes a token through many layers before outputting a prediction. The logit lens shows you what the model would predict if it stopped at each layer along the way.
25 min · Reviewed 2026
A Diagnostic Probe for the Residual Stream
Transformers build up predictions layer by layer. Each layer reads the residual stream — a running hidden state — and writes a correction. The logit lens technique, popularized by a 2020 LessWrong post by nostalgebraist, is to apply the model's final unembedding matrix to intermediate layers, as if prediction happened there.
What you see
Early layers: predictions close to the input token or simple patterns
Middle layers: predictions related to the general category or topic
Later layers: predictions refine toward the correct next token
Near the end: the final answer crystallizes
Variants and refinements
Tuned lens (Belrose et al. 2023): train small translators per layer for more accurate reading
Logit difference: compare predictions for target vs. distractor tokens
Direct logit attribution: decompose which components contributed to a prediction
Patchscopes: use a stronger model to interpret activations from a weaker one
The big idea: the logit lens is one of the cheapest interpretability tools. One line of code gives you a new window into how a transformer thinks.
End-of-lesson check
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-safety2-logit-lens-builders
What does the logit lens technique do?
Trains new weights for each layer of the transformer
Removes the residual stream to simplify processing
Replaces the attention mechanism with direct connections
Applies the model's final unembedding matrix to intermediate layer activations
What would you expect to see when applying the logit lens to early layers of a transformer?
Predictions close to the input token or simple patterns
Complete mathematical gibberish with no token structure
Predictions about unrelated random topics
The exact final correct answer
A transformer predicts token A at layer 10 but predicts the correct token B at layer 25. What does this demonstrate?
The early layers contained the correct answer but chose not to output it
The residual stream was erased between layers 10 and 25
The model is malfunctioning and producing inconsistent results
Later processing corrected or added information to reach the right answer
If a wrong prediction stays wrong from early layers through late layers, what can you conclude?
Early layers should be removed to fix the issue
The model has learned the wrong thing completely
The logit lens technique has failed permanently
The necessary information was never present in the residual stream or was erased
How does the 'tuned lens' improve upon the basic logit lens?
It uses a completely different type of neural network
It trains small translator networks per layer for more accurate reading
It removes the need for any matrix operations
It applies the same unembedding to all layers without adjustment
What is the 'residual stream' in a transformer architecture?
The final output layer that produces token predictions
The initial token embedding that starts the forward pass
A running hidden state that each layer reads from and writes corrections to
A type of attention head that connects distant tokens
If the logit lens reads what appears to be gibberish from early layers, what might be happening?
Early layers contain absolutely no usable information
The residual stream has been deleted
The representations have changed basis across layers, making the lens inapplicable
The model has completely failed and produces no meaningful output
What is the purpose of 'logit difference' when using interpretability tools?
Subtracting the input embeddings from the final output
Adding the logit values from all layers together
Comparing the model's prediction scores for a target token versus a distractor token
Multiplying predictions across multiple models
What are 'patchscopes' in interpretability research?
A method for connecting different transformer layers directly
A technique for removing unnecessary model parameters
Using a stronger model to interpret or make sense of activations from a weaker model
Patching software bugs in transformer architectures
Why is the logit lens described as 'one of the cheapest interpretability tools'?
It can only be used on small models
It requires only a single line of code to implement
It uses no computational resources whatsoever
It requires no access to model weights
What assumption does the logit lens make about early-layer activations?
That the final unembedding matrix can be meaningfully applied to them
That they contain perfect representations of the final answer
That they are identical to late-layer activations
That they contain no information whatsoever
In the middle layers of a transformer, what type of predictions does the logit lens typically reveal?
Predictions that are always wrong
The exact correct next token in all cases
Random predictions unrelated to the input
Predictions related to the general category or topic of the input
What does 'direct logit attribution' do?
Decomposes which model components contributed to a particular prediction
Removes the logit layer entirely from the model
Creates new logits for different tasks
Changes the logits to improve model accuracy
According to the text, who first popularized the logit lens technique?
Researchers at a major AI laboratory
The developers of GPT models
A user named nostalgebraist on the LessWrong forum
The authors of the original transformer paper
What happens in the layers 'near the end' of a transformer according to the logit lens observations?
Information starts being erased from the stream
The final answer crystallizes into its precise form