Scaled Dot-Product Attention
2 of 28Transformer Architecture Deep Dive
Scaled Dot-Product Attention
Lesson 1 set up why attention exists. This lesson is the exact math: scaled dot-product attention as defined in the 2017 paper, with every term motivated. The equation is short — five lines — but every symbol carries weight, and understanding why each piece is there is what separates using transformers from understanding them.
1. The Equation
Attention(Q, K, V) = softmax( Q · Kᵀ / √d_k ) · V
Five elements:
- Q ∈ ℝ^{n × d_k} — queries (one per position).
- K ∈ ℝ^{n × d_k} — keys (one per position).
- V ∈ ℝ^{n × d_v} — values (one per position).
- √d_k — scaling factor (a scalar).
- softmax — applied row-wise to produce attention weights summing to 1 per query.
Output shape: n × d_v — one vector per query position, the attention-weighted combination of values.
2. Step by Step
# Inputs:
Q : (batch, n, d_k)
K : (batch, n, d_k)
V : (batch, n, d_v)
# Step 1: similarity scores via dot product
scores = Q @ K.transpose(-2, -1) # (batch, n, n)
# Step 2: scale
scores = scores / sqrt(d_k)
# Step 3: softmax over the last axis (each row sums to 1)
attention = softmax(scores, dim=-1) # (batch, n, n)
# Step 4: weighted combination of values
output = attention @ V # (batch, n, d_v)
Four lines of PyTorch. Lesson 3 turns this into runnable code; here we focus on why each step exists.
3. Why Q, K, V — Three Tensors, Not One?
In self-attention, all three derive from the same input X by linear projections:
Q = X · W_Q # what each position wants
K = X · W_K # what each position offers as a "key"
V = X · W_V # what each position contributes if attended to
Three different projections of the same source give the network the freedom to put query/key/value in different subspaces. Letting the same vector be both query and key forces a symmetric attention pattern. The three projections break that symmetry and roughly triple the expressive power per attention block.
4. The Dot Product as Similarity
q · k is large when q and k point in similar
directions (and small or negative otherwise). Effectively a
similarity score. After the softmax, queries that "match"
certain keys get high attention weight to those keys'
corresponding values.
Geometric intuition: the keys live in some embedding space. Each query is a probe into that space; the dot product ranks which keys it most aligns with; the softmax turns rankings into a probability distribution.
5. The √d_k Scaling — Why?
This is the most-asked question about the 2017 paper. The answer:
When q and k are vectors with components ~𝒩(0, 1), the
dot product q · k = Σ q_i · k_i is the sum of
d_k zero-mean random variables. By the Central Limit
Theorem its variance is d_k. So
std(q · k) = √d_k.
Without scaling, dot products grow with d_k. As they grow, softmax saturates — the largest score gets nearly all the weight, others get ~0. Gradients through softmax in the saturated regime are vanishingly small. Training stalls.
Dividing by √d_k keeps the variance roughly 1 regardless of d_k, keeping softmax in the well-behaved regime.
6. The Attention Matrix
The intermediate softmax(QKᵀ / √d_k) is the
attention matrix — an (n × n) matrix where
entry (i, j) is "how much position i attends to position
j". Rows sum to 1.
| Property | Implication |
|---|---|
| Row-stochastic | Each row is a probability distribution |
| Asymmetric in general | i attending to j ≠ j attending to i (different Q and K) |
| Inspectable | You can plot the matrix and see which tokens attend to which |
| O(n²) in memory | Source of the long-context cost |
Visualizing attention matrices is the standard way to interpret what a trained transformer is doing. In Lesson 3 you'll build one and plot it.
7. Masking
Two cases where some attention weights must be forced to 0:
- Causal masking — for autoregressive models (GPT), token i can only attend to tokens 1..i, not the future.
- Padding masking — when batches mix sequences of different lengths, the padding tokens shouldn't influence anything.
Implementation: set the relevant scores to -∞
(or a very large negative number) before the softmax.
exp(-∞) = 0, so the softmax assigns these
positions zero attention weight while still summing to 1
over the unmasked positions.
scores = Q @ K.transpose(-2, -1) / sqrt(d_k)
scores = scores.masked_fill(mask == 0, float("-inf"))
attention = softmax(scores, dim=-1)
8. Cross-Attention vs Self-Attention
Same equation, different sources for Q vs K, V:
| Variant | Q from | K, V from | Use case |
|---|---|---|---|
| Self-attention | X | X (same) | Encoder; decoder self-attn within target |
| Cross-attention | Decoder hidden state | Encoder hidden state | Decoder attending to source (translation, conditioning) |
| Causal self-attention | X | X with causal mask | Decoder; autoregressive generation |
Lesson 4 compares the two in depth.
9. Cost Accounting
For sequence length n and head dimension d:
- Q · Kᵀ: O(n² · d) FLOPs, O(n²) memory.
- softmax: O(n²) FLOPs, O(n²) memory.
- · V: O(n² · d) FLOPs, O(n · d) memory for the output.
Memory grows as n²; this is what dies first at long contexts. Flash Attention (Lesson 22) fixes this by computing attention in tiles without materialising the full n × n matrix.