The Attention Revolution
1 of 28Transformer Architecture Deep Dive
The Attention Revolution
In 2017, eight researchers at Google published "Attention Is All You Need" and the deep-learning landscape rotated 90 degrees. Within five years every dominant model in NLP, vision, speech, biology, and protein folding shared the same architectural backbone: the transformer. This course is the practitioner's deep dive into how attention works, why it beat what came before, and how to build one yourself. This first lesson is the historical and conceptual setup — understanding why attention mattered makes the rest of the course click.
1. The Problem Attention Solved
Pre-2017, sequence models were RNNs / LSTMs / GRUs. Two structural problems limited them:
- Sequential bottleneck — token n could only be processed after token n-1. No parallelism within a sequence; GPUs sat idle most of the training time.
- Long-range dependencies — information from token 1 had to traverse hundreds of recurrent steps to influence token 100. Even with LSTM gating, long-range signal degraded.
Attention sidestepped both. Every token attends to every other token in parallel; "distance" between tokens is uniform regardless of position; entire sequences process at once.
2. The Attention Idea, in Plain English
At each position, ask: "which other positions in the sequence are most relevant to me right now, and what do they say?" Combine those relevant positions weighted by their relevance.
For each query position i:
1. Compare query_i against every key_j in the sequence
2. Score each comparison (high if j is relevant to i)
3. Soft-weight (softmax) the scores to get α_{i,1..n}
4. Output_i = Σ_j α_{i,j} · value_j
Three tensors fuel this: queries ("what am I looking for?"), keys ("what do I match against?"), and values ("what do I deliver?"). Lesson 2 derives the exact equation.
3. The Conceptual Lineage
| Year | Paper / model | Contribution |
|---|---|---|
| 2014 | Sutskever et al. — Seq2Seq | Encoder-decoder LSTMs for translation |
| 2015 | Bahdanau et al. — Neural Machine Translation by Jointly Learning to Align and Translate | Soft attention added between encoder and decoder; removed the fixed-vector bottleneck |
| 2015 | Luong et al. | Variants of attention scoring (dot-product, multiplicative, additive) |
| 2016 | Cheng et al. — Long Short-Term Memory-Networks for Machine Reading | Self-attention within a sequence (called "intra-attention") |
| 2017 | Vaswani et al. — Attention Is All You Need | Removed recurrence entirely; pure-attention architecture |
| 2018 | BERT, GPT-1 | Pretrain + fine-tune at scale |
| 2020 | GPT-3 | In-context learning; foundation models |
| 2022-26 | ChatGPT, Claude, GPT-4, LLaMA, Gemma, Qwen | Transformers as the default for every modality |
Each step pruned what wasn't needed. The 2017 transformer is the cleanest expression: drop the recurrence, keep the attention.
4. Why Attention Won (Three Reasons)
- Parallelism — every position processes simultaneously. RNN training: O(seq_len) sequential steps. Transformer training: O(1) sequential steps. GPUs saturate. Wall-clock training time on the same data drops 5-10×.
- Distance-agnostic — token 1 and token 100 are one attention step apart. RNN: 99 steps. Long-range patterns are easier to learn.
- Composability — the same attention block, stacked many times, scales smoothly. Doubling the depth doubles the parameter count linearly. Training stays stable up to hundreds of layers (with normalisation tricks). Empirically, transformers exhibit the cleanest "more parameters → better" scaling law of any architecture family.
5. What Attention Costs
6. The Models You'll Recognise
| Model | Year | What it gets right |
|---|---|---|
| Transformer (encoder-decoder) | 2017 | The original; translation |
| BERT | 2018 | Encoder-only; bidirectional context; MLM pretraining |
| GPT-1/2/3/4 | 2018-23 | Decoder-only; causal LM; in-context learning |
| T5 | 2019 | "Everything is text-to-text" |
| ViT (Vision Transformer) | 2020 | Transformers on image patches; killed CNN dominance |
| CLIP | 2021 | Image-text contrastive; foundation model for retrieval |
| LLaMA / Mistral / Qwen / Gemma | 2023-26 | Open-weight decoder-only LLMs |
| AlphaFold 2 | 2021 | Protein structure; transformer + structural priors |
| Whisper | 2022 | Encoder-decoder for speech-to-text |
| Stable Diffusion / DALL-E | 2022+ | Transformer-conditioned diffusion for image generation |
Six modalities (text, vision, speech, biology, multimodal, generation), one architecture. That's the second-order significance of "Attention Is All You Need" — not just an NLP paper, an architectural standard.
7. The Course Map
- Section 1 (this one): the attention mechanism — math, intuition, hands-on implementation.
- Section 2: the full transformer architecture — positional encoding, multi-head, layer norm, residuals.
- Section 3: BERT and the encoder-only family.
- Section 4: GPT and the decoder-only family; scaling laws; RLHF.
- Section 5: modern variants — vision transformers, Flash Attention, MoE.
- Section 6: build a transformer end-to-end and train a mini-GPT from scratch as the capstone.
8. Prerequisites Check
This course assumes you've finished the Deep Learning with PyTorch course (or equivalent). You should be comfortable with:
- Tensors, autograd, the standard PyTorch training loop.
- Linear layers, activations, layer normalisation.
- Embeddings, dropout, the Adam family of optimisers.
- RNN / LSTM at a conceptual level (you don't need to remember the gating equations; you do need to remember "they processed sequences one step at a time").
If any of those feel shaky, the deep-learning-pytorch course (especially Sections 1-2) is the prerequisite refresher. Otherwise, Lesson 2 begins the deep dive into scaled dot-product attention.