Transformers Under the Hood

Attention, positional encoding, residual streams. A walk through the architecture that powers every frontier language model today.

55 min · Reviewed 2026

Why the Transformer Won

Before 2017, recurrent networks processed text one token at a time, making long-range dependencies hard and training slow. The transformer replaced recurrence with self-attention, which looks at all tokens in parallel. The result: massive speedups and dramatically better modeling of long context.

The core mechanism: self-attention

For each token, the model computes three vectors: a query (Q), a key (K), and a value (V). The query for token i asks, which other tokens matter for me? The keys answer, how relevant am I? The attention weights are the dot products, softmaxed. The output is a weighted sum of values.

# Scaled dot-product attention, the math heart of transformers
import torch
import torch.nn.functional as F

def attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / d_k ** 0.5
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)
    return weights @ V

# Q, K, V shapes: (batch, heads, seq_len, d_head)Attention from first principles. Everything else in a transformer is plumbing around this.

Multi-head attention

Running attention once gives you one perspective on the sequence. Transformers run it many times in parallel, each head with its own learned projections. Different heads end up focusing on syntax, coreference, long-range structure, and so on. The outputs are concatenated and projected.

The full block

LayerNorm on the input
Multi-head self-attention
Residual connection (add the input back)
LayerNorm again
Feed-forward MLP (two linear layers with activation)
Residual connection

Positional encoding

Attention alone is permutation-invariant — it does not know token order. Original transformers added sinusoidal positional encodings to token embeddings. Modern models mostly use rotary positional embeddings (RoPE) applied inside attention, which generalize better to longer contexts.

Decoder-only and the causal mask

GPT-style models use decoder-only transformers with a causal mask that prevents any token from attending to future tokens. This is what makes autoregressive generation possible — predict the next token, append, predict again.

Variant	Use case	Attention pattern
Encoder-only (BERT)	Classification, embeddings	Bidirectional
Decoder-only (GPT, Claude, Llama)	Text generation	Causal (left-to-right)
Encoder-decoder (T5)	Translation, summarization	Cross-attention

Recent architectural tweaks

Grouped-query attention (GQA) reduces inference memory
Mixture of Experts (MoE) routes tokens to expert MLPs
Flash Attention speeds up attention on long sequences
State space models like Mamba challenge pure attention

Attention is all you need — but the plumbing around it is where the engineering is.
— A transformer implementer

The big idea: a transformer is a stack of blocks that repeatedly edit a shared stream of vectors using attention and MLPs. Every frontier LLM is a variation on that theme.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-creators-transformers-under-the-hood

What is the core idea behind "Transformers Under the Hood"?
1. Attention, positional encoding, residual streams. A walk through the architecture that powers every frontier language model today.
2. 'Help me think through this hard situation I am in.'
3. future schools
4. Predict loss as a function of parameters and tokens
Which term best describes a foundational idea in "Transformers Under the Hood"?
1. multi-head
2. self-attention
3. residual stream
4. positional encoding
A learner studying Transformers Under the Hood would need to understand which concept?
1. self-attention
2. residual stream
3. multi-head
4. positional encoding
Which of these is directly relevant to Transformers Under the Hood?
1. self-attention
2. multi-head
3. positional encoding
4. residual stream
Which of the following is a key point about Transformers Under the Hood?
1. LayerNorm on the input
2. Multi-head self-attention
3. Residual connection (add the input back)
4. LayerNorm again
Which of these does NOT belong in a discussion of Transformers Under the Hood?
1. Residual connection (add the input back)
2. Multi-head self-attention
3. LayerNorm on the input
4. 'Help me think through this hard situation I am in.'
Which statement is accurate regarding Transformers Under the Hood?
1. Mixture of Experts (MoE) routes tokens to expert MLPs
2. Flash Attention speeds up attention on long sequences
3. Grouped-query attention (GQA) reduces inference memory
4. State space models like Mamba challenge pure attention
Which of these does NOT belong in a discussion of Transformers Under the Hood?
1. 'Help me think through this hard situation I am in.'
2. Mixture of Experts (MoE) routes tokens to expert MLPs
3. Flash Attention speeds up attention on long sequences
4. Grouped-query attention (GQA) reduces inference memory
What is the key insight about "The residual stream" in the context of Transformers Under the Hood?
1. Because of residual connections, each block reads from and writes to a shared stream of vectors.
2. 'Help me think through this hard situation I am in.'
3. future schools
4. Predict loss as a function of parameters and tokens
What is the recommended tip about "Ground your practice in fundamentals" in the context of Transformers Under the Hood?
1. 'Help me think through this hard situation I am in.'
2. Every AI capability has an underlying mechanism. Understanding that mechanism tells you where it'll fail — which is more…
3. future schools
4. Predict loss as a function of parameters and tokens
Which statement accurately describes an aspect of Transformers Under the Hood?
1. 'Help me think through this hard situation I am in.'
2. future schools
3. Before 2017, recurrent networks processed text one token at a time, making long-range dependencies hard and training slow.
4. Predict loss as a function of parameters and tokens
What does working with Transformers Under the Hood typically involve?
1. 'Help me think through this hard situation I am in.'
2. future schools
3. Predict loss as a function of parameters and tokens
4. For each token, the model computes three vectors: a query (Q), a key (K), and a value (V).
Which of the following is true about Transformers Under the Hood?
1. Running attention once gives you one perspective on the sequence. Transformers run it many times in parallel, each head with its own learned…
2. 'Help me think through this hard situation I am in.'
3. future schools
4. Predict loss as a function of parameters and tokens
Which best describes the scope of "Transformers Under the Hood"?
1. It is unrelated to foundations workflows
2. It focuses on Attention, positional encoding, residual streams. A walk through the architecture that powers every
3. It applies only to the opposite beginner tier
4. It was deprecated in 2024 and no longer relevant
Which section heading best belongs in a lesson about Transformers Under the Hood?
1. 'Help me think through this hard situation I am in.'
2. future schools
3. The core mechanism: self-attention
4. Predict loss as a function of parameters and tokens

← Back to interactive lesson

Tendril · Creators · AI Foundations