Lesson 319 of 2116
Attention Is All You Need, 2017
Eight Google authors replaced recurrence with attention and quietly launched the modern AI era.
Lesson map
What this lesson covers
Learning path
The main moves in order
- 1The Paper That Rewired NLP
- 2Transformer
- 3self-attention
- 4Vaswani
Concept cluster
Terms to connect while reading
Section 1
The Paper That Rewired NLP
In June 2017, eight authors from Google Brain and Google Research, led by Ashish Vaswani, posted Attention Is All You Need on arXiv. The paper proposed the Transformer, an architecture with no recurrence and no convolutions. Just attention, feed-forward layers, and residual connections.
The immediate target was machine translation, where Transformers beat the best recurrent models on English-to-German while training several times faster. Within two years, the paper would be one of the most cited in AI history.
Why Transformers took over
- 1Parallel training: unlike RNNs, attention processes all positions at once, making full use of GPUs
- 2Long-range dependencies: attention sees the whole sequence, not just a compressed state
- 3Scaling: performance improved smoothly with more parameters, more data, and more compute
- 4Generality: the same architecture worked for text, images, audio, code, and proteins
The Transformer became the common substrate for everything that followed. BERT from Google in 2018 used the encoder. GPT from OpenAI used the decoder. Vision Transformers, Whisper, AlphaFold, Stable Diffusion, and the current generation of chat models all rest on it.
“We propose a new simple network architecture, the Transformer, based solely on attention mechanisms.”
Key terms in this lesson
The big idea: one clean architectural choice, attention without recurrence, unlocked a decade of scaling. Every LLM you use today traces directly to this paper.
End-of-lesson quiz
Check what stuck
15 questions · Score saves to your progress.
Tutor
Curious about “Attention Is All You Need, 2017”?
Ask anything about this lesson. I’ll answer using just what you’re reading — short, friendly, grounded.
Progress saved locally in this browser. Sign in to sync across devices.
Related lessons
Keep going
Creators · 55 min
Transformers Under the Hood
Attention, positional encoding, residual streams. A walk through the architecture that powers every frontier language model today.
Creators · 55 min
The Three Ingredients: Data, Compute, Algorithms (Capstone)
Every AI breakthrough of the past decade rests on three interacting ingredients. Synthesize everything you have learned into one working model.
Creators · 45 min
Uncertainty Quantification in LLMs
A model that says 'I am 95 percent sure' and is wrong 40 percent of the time is miscalibrated. Measuring that gap is uncertainty quantification.
