AIMaks

The Attention Revolution

30 min readvideoAttention Mechanisms
1 of 28Transformer Architecture Deep Dive

The Attention Revolution

In 2017, eight researchers at Google published "Attention Is All You Need" and the deep-learning landscape rotated 90 degrees. Within five years every dominant model in NLP, vision, speech, biology, and protein folding shared the same architectural backbone: the transformer. This course is the practitioner's deep dive into how attention works, why it beat what came before, and how to build one yourself. This first lesson is the historical and conceptual setup — understanding why attention mattered makes the rest of the course click.

1. The Problem Attention Solved

Pre-2017, sequence models were RNNs / LSTMs / GRUs. Two structural problems limited them:

  • Sequential bottleneck — token n could only be processed after token n-1. No parallelism within a sequence; GPUs sat idle most of the training time.
  • Long-range dependencies — information from token 1 had to traverse hundreds of recurrent steps to influence token 100. Even with LSTM gating, long-range signal degraded.

Attention sidestepped both. Every token attends to every other token in parallel; "distance" between tokens is uniform regardless of position; entire sequences process at once.

2. The Attention Idea, in Plain English

At each position, ask: "which other positions in the sequence are most relevant to me right now, and what do they say?" Combine those relevant positions weighted by their relevance.

code
For each query position i:
  1. Compare query_i against every key_j in the sequence
  2. Score each comparison (high if j is relevant to i)
  3. Soft-weight (softmax) the scores to get α_{i,1..n}
  4. Output_i = Σ_j α_{i,j} · value_j

Three tensors fuel this: queries ("what am I looking for?"), keys ("what do I match against?"), and values ("what do I deliver?"). Lesson 2 derives the exact equation.

3. The Conceptual Lineage

YearPaper / modelContribution
2014Sutskever et al. — Seq2SeqEncoder-decoder LSTMs for translation
2015Bahdanau et al. — Neural Machine Translation by Jointly Learning to Align and TranslateSoft attention added between encoder and decoder; removed the fixed-vector bottleneck
2015Luong et al.Variants of attention scoring (dot-product, multiplicative, additive)
2016Cheng et al. — Long Short-Term Memory-Networks for Machine ReadingSelf-attention within a sequence (called "intra-attention")
2017Vaswani et al. — Attention Is All You NeedRemoved recurrence entirely; pure-attention architecture
2018BERT, GPT-1Pretrain + fine-tune at scale
2020GPT-3In-context learning; foundation models
2022-26ChatGPT, Claude, GPT-4, LLaMA, Gemma, QwenTransformers as the default for every modality

Each step pruned what wasn't needed. The 2017 transformer is the cleanest expression: drop the recurrence, keep the attention.

4. Why Attention Won (Three Reasons)

  1. Parallelism — every position processes simultaneously. RNN training: O(seq_len) sequential steps. Transformer training: O(1) sequential steps. GPUs saturate. Wall-clock training time on the same data drops 5-10×.
  2. Distance-agnostic — token 1 and token 100 are one attention step apart. RNN: 99 steps. Long-range patterns are easier to learn.
  3. Composability — the same attention block, stacked many times, scales smoothly. Doubling the depth doubles the parameter count linearly. Training stays stable up to hundreds of layers (with normalisation tricks). Empirically, transformers exhibit the cleanest "more parameters → better" scaling law of any architecture family.

5. What Attention Costs

6. The Models You'll Recognise

ModelYearWhat it gets right
Transformer (encoder-decoder)2017The original; translation
BERT2018Encoder-only; bidirectional context; MLM pretraining
GPT-1/2/3/42018-23Decoder-only; causal LM; in-context learning
T52019"Everything is text-to-text"
ViT (Vision Transformer)2020Transformers on image patches; killed CNN dominance
CLIP2021Image-text contrastive; foundation model for retrieval
LLaMA / Mistral / Qwen / Gemma2023-26Open-weight decoder-only LLMs
AlphaFold 22021Protein structure; transformer + structural priors
Whisper2022Encoder-decoder for speech-to-text
Stable Diffusion / DALL-E2022+Transformer-conditioned diffusion for image generation

Six modalities (text, vision, speech, biology, multimodal, generation), one architecture. That's the second-order significance of "Attention Is All You Need" — not just an NLP paper, an architectural standard.

7. The Course Map

  1. Section 1 (this one): the attention mechanism — math, intuition, hands-on implementation.
  2. Section 2: the full transformer architecture — positional encoding, multi-head, layer norm, residuals.
  3. Section 3: BERT and the encoder-only family.
  4. Section 4: GPT and the decoder-only family; scaling laws; RLHF.
  5. Section 5: modern variants — vision transformers, Flash Attention, MoE.
  6. Section 6: build a transformer end-to-end and train a mini-GPT from scratch as the capstone.

8. Prerequisites Check

This course assumes you've finished the Deep Learning with PyTorch course (or equivalent). You should be comfortable with:

  • Tensors, autograd, the standard PyTorch training loop.
  • Linear layers, activations, layer normalisation.
  • Embeddings, dropout, the Adam family of optimisers.
  • RNN / LSTM at a conceptual level (you don't need to remember the gating equations; you do need to remember "they processed sequences one step at a time").

If any of those feel shaky, the deep-learning-pytorch course (especially Sections 1-2) is the prerequisite refresher. Otherwise, Lesson 2 begins the deep dive into scaled dot-product attention.

9. The Mental Model

Up next · Scaled Dot-Product Attention