Loading lesson…
Attention, positional encoding, residual streams. A walk through the architecture that powers every frontier language model today.
Before 2017, recurrent networks processed text one token at a time, making long-range dependencies hard and training slow. The transformer replaced recurrence with self-attention, which looks at all tokens in parallel. The result: massive speedups and dramatically better modeling of long context.
For each token, the model computes three vectors: a query (Q), a key (K), and a value (V). The query for token i asks, which other tokens matter for me? The keys answer, how relevant am I? The attention weights are the dot products, softmaxed. The output is a weighted sum of values.
# Scaled dot-product attention, the math heart of transformers
import torch
import torch.nn.functional as F
def attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = Q @ K.transpose(-2, -1) / d_k ** 0.5
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
return weights @ V
# Q, K, V shapes: (batch, heads, seq_len, d_head)Attention from first principles. Everything else in a transformer is plumbing around this.Running attention once gives you one perspective on the sequence. Transformers run it many times in parallel, each head with its own learned projections. Different heads end up focusing on syntax, coreference, long-range structure, and so on. The outputs are concatenated and projected.
Attention alone is permutation-invariant — it does not know token order. Original transformers added sinusoidal positional encodings to token embeddings. Modern models mostly use rotary positional embeddings (RoPE) applied inside attention, which generalize better to longer contexts.
GPT-style models use decoder-only transformers with a causal mask that prevents any token from attending to future tokens. This is what makes autoregressive generation possible — predict the next token, append, predict again.
| Variant | Use case | Attention pattern |
|---|---|---|
| Encoder-only (BERT) | Classification, embeddings | Bidirectional |
| Decoder-only (GPT, Claude, Llama) | Text generation | Causal (left-to-right) |
| Encoder-decoder (T5) | Translation, summarization | Cross-attention |
Attention is all you need — but the plumbing around it is where the engineering is.
— A transformer implementer
The big idea: a transformer is a stack of blocks that repeatedly edit a shared stream of vectors using attention and MLPs. Every frontier LLM is a variation on that theme.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-transformers-under-the-hood
What is the core idea behind "Transformers Under the Hood"?
Which term best describes a foundational idea in "Transformers Under the Hood"?
A learner studying Transformers Under the Hood would need to understand which concept?
Which of these is directly relevant to Transformers Under the Hood?
Which of the following is a key point about Transformers Under the Hood?
Which of these does NOT belong in a discussion of Transformers Under the Hood?
Which statement is accurate regarding Transformers Under the Hood?
Which of these does NOT belong in a discussion of Transformers Under the Hood?
What is the key insight about "The residual stream" in the context of Transformers Under the Hood?
What is the recommended tip about "Ground your practice in fundamentals" in the context of Transformers Under the Hood?
Which statement accurately describes an aspect of Transformers Under the Hood?
What does working with Transformers Under the Hood typically involve?
Which of the following is true about Transformers Under the Hood?
Which best describes the scope of "Transformers Under the Hood"?
Which section heading best belongs in a lesson about Transformers Under the Hood?