The libraryTransformer Architecture Deep Dive

Scaled Dot-Product Attention

35 min readvideoAttention Mechanisms

2 of 28Transformer Architecture Deep Dive

Scaled Dot-Product Attention

Lesson 1 set up why attention exists. This lesson is the exact math: scaled dot-product attention as defined in the 2017 paper, with every term motivated. The equation is short — five lines — but every symbol carries weight, and understanding why each piece is there is what separates using transformers from understanding them.

1. The Equation

code

Attention(Q, K, V) = softmax( Q · Kᵀ / √d_k ) · V

Five elements:

Q ∈ ℝ^{n × d_k} — queries (one per position).
K ∈ ℝ^{n × d_k} — keys (one per position).
V ∈ ℝ^{n × d_v} — values (one per position).
√d_k — scaling factor (a scalar).
softmax — applied row-wise to produce attention weights summing to 1 per query.

Output shape: n × d_v — one vector per query position, the attention-weighted combination of values.

2. Step by Step

code

# Inputs:
Q : (batch, n, d_k)
K : (batch, n, d_k)
V : (batch, n, d_v)

# Step 1: similarity scores via dot product
scores = Q @ K.transpose(-2, -1)        # (batch, n, n)

# Step 2: scale
scores = scores / sqrt(d_k)

# Step 3: softmax over the last axis (each row sums to 1)
attention = softmax(scores, dim=-1)     # (batch, n, n)

# Step 4: weighted combination of values
output = attention @ V                  # (batch, n, d_v)

Four lines of PyTorch. Lesson 3 turns this into runnable code; here we focus on why each step exists.

3. Why Q, K, V — Three Tensors, Not One?

In self-attention, all three derive from the same input X by linear projections:

code

Q = X · W_Q     # what each position wants
K = X · W_K     # what each position offers as a "key"
V = X · W_V     # what each position contributes if attended to

Three different projections of the same source give the network the freedom to put query/key/value in different subspaces. Letting the same vector be both query and key forces a symmetric attention pattern. The three projections break that symmetry and roughly triple the expressive power per attention block.

4. The Dot Product as Similarity

q · k is large when q and k point in similar directions (and small or negative otherwise). Effectively a similarity score. After the softmax, queries that "match" certain keys get high attention weight to those keys' corresponding values.

Geometric intuition: the keys live in some embedding space. Each query is a probe into that space; the dot product ranks which keys it most aligns with; the softmax turns rankings into a probability distribution.

5. The √d_k Scaling — Why?

This is the most-asked question about the 2017 paper. The answer:

When q and k are vectors with components ~𝒩(0, 1), the dot product q · k = Σ q_i · k_i is the sum of d_k zero-mean random variables. By the Central Limit Theorem its variance is d_k. So std(q · k) = √d_k.

Without scaling, dot products grow with d_k. As they grow, softmax saturates — the largest score gets nearly all the weight, others get ~0. Gradients through softmax in the saturated regime are vanishingly small. Training stalls.

Dividing by √d_k keeps the variance roughly 1 regardless of d_k, keeping softmax in the well-behaved regime.

6. The Attention Matrix

The intermediate softmax(QKᵀ / √d_k) is the attention matrix — an (n × n) matrix where entry (i, j) is "how much position i attends to position j". Rows sum to 1.

Property	Implication
Row-stochastic	Each row is a probability distribution
Asymmetric in general	i attending to j ≠ j attending to i (different Q and K)
Inspectable	You can plot the matrix and see which tokens attend to which
O(n²) in memory	Source of the long-context cost

Visualizing attention matrices is the standard way to interpret what a trained transformer is doing. In Lesson 3 you'll build one and plot it.

7. Masking

Two cases where some attention weights must be forced to 0:

Causal masking — for autoregressive models (GPT), token i can only attend to tokens 1..i, not the future.
Padding masking — when batches mix sequences of different lengths, the padding tokens shouldn't influence anything.

Implementation: set the relevant scores to -∞ (or a very large negative number) before the softmax. exp(-∞) = 0, so the softmax assigns these positions zero attention weight while still summing to 1 over the unmasked positions.

code

scores = Q @ K.transpose(-2, -1) / sqrt(d_k)
scores = scores.masked_fill(mask == 0, float("-inf"))
attention = softmax(scores, dim=-1)

8. Cross-Attention vs Self-Attention

Same equation, different sources for Q vs K, V:

Variant	Q from	K, V from	Use case
Self-attention	X	X (same)	Encoder; decoder self-attn within target
Cross-attention	Decoder hidden state	Encoder hidden state	Decoder attending to source (translation, conditioning)
Causal self-attention	X	X with causal mask	Decoder; autoregressive generation

Lesson 4 compares the two in depth.

9. Cost Accounting

For sequence length n and head dimension d:

Q · Kᵀ: O(n² · d) FLOPs, O(n²) memory.
softmax: O(n²) FLOPs, O(n²) memory.
· V: O(n² · d) FLOPs, O(n · d) memory for the output.

Memory grows as n²; this is what dies first at long contexts. Flash Attention (Lesson 22) fixes this by computing attention in tiles without materialising the full n × n matrix.

10. The Mental Model

← Previous lessonThe Attention Revolution

Up next · Implementing Attention from Scratch