Transformers Under the Hood

Section 1

Why the Transformer Won

Attention from first principles. Everything else in a transformer is plumbing around this.

python

# Scaled dot-product attention, the math heart of transformers
import torch
import torch.nn.functional as F

def attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = Q @ K.transpose(-2, -1) / d_k ** 0.5
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = F.softmax(scores, dim=-1)
    return weights @ V

# Q, K, V shapes: (batch, heads, seq_len, d_head)

Compare the options

Variant	Use case	Attention pattern
Encoder-only (BERT)	Classification, embeddings	Bidirectional
Decoder-only (GPT, Claude, Llama)	Text generation	Causal (left-to-right)
Encoder-decoder (T5)	Translation, summarization	Cross-attention

Key terms in this lesson

Transformers Under the Hood

Why the Transformer Won

The core mechanism: self-attention

Multi-head attention

The full block

Positional encoding

Decoder-only and the causal mask

Recent architectural tweaks

Curious about “Transformers Under the Hood”?

Keep going

Transformers Under the Hood

Why the Transformer Won

The core mechanism: self-attention

Multi-head attention

The full block

Positional encoding

Decoder-only and the causal mask

Recent architectural tweaks

Curious about “Transformers Under the Hood”?

Keep going