Attention Is All You Need, 2017

Eight Google authors replaced recurrence with attention and quietly launched the modern AI era.

32 min · Reviewed 2026

The Paper That Rewired NLP

In June 2017, eight authors from Google Brain and Google Research, led by Ashish Vaswani, posted Attention Is All You Need on arXiv. The paper proposed the Transformer, an architecture with no recurrence and no convolutions. Just attention, feed-forward layers, and residual connections.

The immediate target was machine translation, where Transformers beat the best recurrent models on English-to-German while training several times faster. Within two years, the paper would be one of the most cited in AI history.

Why Transformers took over

Parallel training: unlike RNNs, attention processes all positions at once, making full use of GPUs
Long-range dependencies: attention sees the whole sequence, not just a compressed state
Scaling: performance improved smoothly with more parameters, more data, and more compute
Generality: the same architecture worked for text, images, audio, code, and proteins

The Transformer became the common substrate for everything that followed. BERT from Google in 2018 used the encoder. GPT from OpenAI used the decoder. Vision Transformers, Whisper, AlphaFold, Stable Diffusion, and the current generation of chat models all rest on it.

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms.
— Vaswani et al., 2017

The big idea: one clean architectural choice, attention without recurrence, unlocked a decade of scaling. Every LLM you use today traces directly to this paper.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-history-attention-2017-creators

In what year was the paper introducing the Transformer architecture published?
1. 2017
2. 2018
3. 2016
4. 2015
Which company employed the eight authors who published the Transformer paper?
1. Meta
2. OpenAI
3. Microsoft
4. Google
What was the primary task that the Transformer originally outperformed?
1. Speech recognition
2. Object detection
3. Image classification
4. Machine translation
What are the three main components of the Transformer architecture according to the original paper?
1. Embedding, encoding, and decoding
2. Recurrence, convolution, and attention
3. Attention, feed-forward layers, and residual connections
4. LSTM cells, pooling, and softmax
Why does attention enable parallel training that RNNs cannot achieve?
1. Attention processes all positions in the sequence simultaneously
2. Attention uses more memory than RNNs
3. Attention eliminates the need for GPUs
4. Attention requires sequential computation
How does attention handle long-range dependencies in a sequence compared to RNNs?
1. Attention can see the entire sequence at once rather than relying on a compressed state
2. Attention requires more layers to capture long-range patterns
3. Attention ignores distant tokens to focus on local context
4. Attention compresses information into a fixed-size state
Which modality was NOT mentioned as an area where Transformers were applied?
1. Proteins
2. Text
3. Video
4. Images
What architectural component did BERT use from the Transformer?
1. The decoder only
2. The encoder only
3. Neither encoder nor decoder
4. Both encoder and decoder
What architectural component did GPT use from the Transformer?
1. Both encoder and decoder
2. Neither encoder nor decoder
3. The encoder only
4. The decoder only
What is the purpose of positional encoding in Transformers?
1. To add information about token order in the sequence
2. To compress sequence length
3. To reduce the number of parameters
4. To replace attention in early layers
In self-attention, how does each token's representation get updated?
1. By looking only at neighboring tokens
2. By averaging all token representations equally
3. By using a fixed pre-trained embedding
4. By comparing itself with every other token in the sequence
What does multi-head attention do differently from single-head attention?
1. It eliminates the need for feed-forward layers
2. It runs multiple attention computations in parallel with different learned weights
3. It processes tokens one at a time
4. It reduces the computational cost of attention
What was the key innovation that the Transformer paper introduced?
1. Using more recurrent layers
2. Replacing recurrence with attention
3. Adding more convolutional layers
4. Combining RNNs and CNNs
What did the quote 'We propose a new simple network architecture, the Transformer' emphasize about the model?
1. Its use of attention mechanisms as the sole building block
2. Its reliance on pre-trained components
3. Its complexity compared to RNNs
4. Its requirement for large datasets
Which model family that powers modern chat applications directly descends from the Transformer?
1. Large language models
2. Hopfield networks
3. CNN variants
4. Recursive neural networks

← Back to interactive lesson

Tendril · Creators · AI Foundations

Attention Is All You Need, 2017

Eight Google authors replaced recurrence with attention and quietly launched the modern AI era.

32 min · Reviewed 2026

The Paper That Rewired NLP

Why Transformers took over

Parallel training: unlike RNNs, attention processes all positions at once, making full use of GPUs
Long-range dependencies: attention sees the whole sequence, not just a compressed state
Scaling: performance improved smoothly with more parameters, more data, and more compute
Generality: the same architecture worked for text, images, audio, code, and proteins

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms.
— Vaswani et al., 2017

The big idea: one clean architectural choice, attention without recurrence, unlocked a decade of scaling. Every LLM you use today traces directly to this paper.

End-of-lesson check

15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-history-attention-2017-creators

In what year was the paper introducing the Transformer architecture published?
1. 2017
2. 2018
3. 2016
4. 2015
Which company employed the eight authors who published the Transformer paper?
1. Meta
2. OpenAI
3. Microsoft
4. Google
What was the primary task that the Transformer originally outperformed?
1. Speech recognition
2. Object detection
3. Image classification
4. Machine translation
What are the three main components of the Transformer architecture according to the original paper?
1. Embedding, encoding, and decoding
2. Recurrence, convolution, and attention
3. Attention, feed-forward layers, and residual connections
4. LSTM cells, pooling, and softmax
Why does attention enable parallel training that RNNs cannot achieve?
1. Attention processes all positions in the sequence simultaneously
2. Attention uses more memory than RNNs
3. Attention eliminates the need for GPUs
4. Attention requires sequential computation
How does attention handle long-range dependencies in a sequence compared to RNNs?
1. Attention can see the entire sequence at once rather than relying on a compressed state
2. Attention requires more layers to capture long-range patterns
3. Attention ignores distant tokens to focus on local context
4. Attention compresses information into a fixed-size state
Which modality was NOT mentioned as an area where Transformers were applied?
1. Proteins
2. Text
3. Video
4. Images
What architectural component did BERT use from the Transformer?
1. The decoder only
2. The encoder only
3. Neither encoder nor decoder
4. Both encoder and decoder
What architectural component did GPT use from the Transformer?
1. Both encoder and decoder
2. Neither encoder nor decoder
3. The encoder only
4. The decoder only
What is the purpose of positional encoding in Transformers?
1. To add information about token order in the sequence
2. To compress sequence length
3. To reduce the number of parameters
4. To replace attention in early layers
In self-attention, how does each token's representation get updated?
1. By looking only at neighboring tokens
2. By averaging all token representations equally
3. By using a fixed pre-trained embedding
4. By comparing itself with every other token in the sequence
What does multi-head attention do differently from single-head attention?
1. It eliminates the need for feed-forward layers
2. It runs multiple attention computations in parallel with different learned weights
3. It processes tokens one at a time
4. It reduces the computational cost of attention
What was the key innovation that the Transformer paper introduced?
1. Using more recurrent layers
2. Replacing recurrence with attention
3. Adding more convolutional layers
4. Combining RNNs and CNNs
What did the quote 'We propose a new simple network architecture, the Transformer' emphasize about the model?
1. Its use of attention mechanisms as the sole building block
2. Its reliance on pre-trained components
3. Its complexity compared to RNNs
4. Its requirement for large datasets
Which model family that powers modern chat applications directly descends from the Transformer?
1. Large language models
2. Hopfield networks
3. CNN variants
4. Recursive neural networks

← Back to interactive lesson