Lesson 3 of 2116
Transformers Under the Hood
Attention, positional encoding, residual streams. A walk through the architecture that powers every frontier language model today.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1Why the Transformer Won
- 2transformer
- 3attention
- 4self-attention
Concept cluster
Terms to connect while reading
Section 1
Why the Transformer Won
Before 2017, recurrent networks processed text one token at a time, making long-range dependencies hard and training slow. The transformer replaced recurrence with self-attention, which looks at all tokens in parallel. The result: massive speedups and dramatically better modeling of long context.
The core mechanism: self-attention
For each token, the model computes three vectors: a query (Q), a key (K), and a value (V). The query for token i asks, which other tokens matter for me? The keys answer, how relevant am I? The attention weights are the dot products, softmaxed. The output is a weighted sum of values.
Attention from first principles. Everything else in a transformer is plumbing around this.
# Scaled dot-product attention, the math heart of transformers
import torch
import torch.nn.functional as F
def attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = Q @ K.transpose(-2, -1) / d_k ** 0.5
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
return weights @ V
# Q, K, V shapes: (batch, heads, seq_len, d_head)Multi-head attention
Running attention once gives you one perspective on the sequence. Transformers run it many times in parallel, each head with its own learned projections. Different heads end up focusing on syntax, coreference, long-range structure, and so on. The outputs are concatenated and projected.
The full block
- 1LayerNorm on the input
- 2Multi-head self-attention
- 3Residual connection (add the input back)
- 4LayerNorm again
- 5Feed-forward MLP (two linear layers with activation)
- 6Residual connection
Positional encoding
Attention alone is permutation-invariant — it does not know token order. Original transformers added sinusoidal positional encodings to token embeddings. Modern models mostly use rotary positional embeddings (RoPE) applied inside attention, which generalize better to longer contexts.
Decoder-only and the causal mask
GPT-style models use decoder-only transformers with a causal mask that prevents any token from attending to future tokens. This is what makes autoregressive generation possible — predict the next token, append, predict again.
Compare the options
| Variant | Use case | Attention pattern |
|---|---|---|
| Encoder-only (BERT) | Classification, embeddings | Bidirectional |
| Decoder-only (GPT, Claude, Llama) | Text generation | Causal (left-to-right) |
| Encoder-decoder (T5) | Translation, summarization | Cross-attention |
Recent architectural tweaks
- Grouped-query attention (GQA) reduces inference memory
- Mixture of Experts (MoE) routes tokens to expert MLPs
- Flash Attention speeds up attention on long sequences
- State space models like Mamba challenge pure attention
“Attention is all you need — but the plumbing around it is where the engineering is.”
Key terms in this lesson
The big idea: a transformer is a stack of blocks that repeatedly edit a shared stream of vectors using attention and MLPs. Every frontier LLM is a variation on that theme.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Transformers Under the Hood”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 55 min
The Three Ingredients: Data, Compute, Algorithms (Capstone)
Every AI breakthrough of the past decade rests on three interacting ingredients. Synthesize everything you have learned into one working model.
Creators · 38 min
In-Context Learning
Show a model three examples, and it learns the task on the spot — without any weight updates. This is one of the strangest properties of transformers.
Creators · 28 min
ResNets and the Depth Breakthrough
A 2015 paper from Microsoft Research let neural networks go 150 layers deep by adding a shortcut.
