Loading lesson…
Eight Google authors replaced recurrence with attention and quietly launched the modern AI era.
In June 2017, eight authors from Google Brain and Google Research, led by Ashish Vaswani, posted Attention Is All You Need on arXiv. The paper proposed the Transformer, an architecture with no recurrence and no convolutions. Just attention, feed-forward layers, and residual connections.
The immediate target was machine translation, where Transformers beat the best recurrent models on English-to-German while training several times faster. Within two years, the paper would be one of the most cited in AI history.
The Transformer became the common substrate for everything that followed. BERT from Google in 2018 used the encoder. GPT from OpenAI used the decoder. Vision Transformers, Whisper, AlphaFold, Stable Diffusion, and the current generation of chat models all rest on it.
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms.
— Vaswani et al., 2017
The big idea: one clean architectural choice, attention without recurrence, unlocked a decade of scaling. Every LLM you use today traces directly to this paper.
15 questions · take it digitally for instant feedback at tendril.neural-forge.io/learn/quiz/end-history-attention-2017-creators
In what year was the paper introducing the Transformer architecture published?
Which company employed the eight authors who published the Transformer paper?
What was the primary task that the Transformer originally outperformed?
What are the three main components of the Transformer architecture according to the original paper?
Why does attention enable parallel training that RNNs cannot achieve?
How does attention handle long-range dependencies in a sequence compared to RNNs?
Which modality was NOT mentioned as an area where Transformers were applied?
What architectural component did BERT use from the Transformer?
What architectural component did GPT use from the Transformer?
What is the purpose of positional encoding in Transformers?
In self-attention, how does each token's representation get updated?
What does multi-head attention do differently from single-head attention?
What was the key innovation that the Transformer paper introduced?
What did the quote 'We propose a new simple network architecture, the Transformer' emphasize about the model?
Which model family that powers modern chat applications directly descends from the Transformer?