Animationstransformers

Self-Attention in Transformers

A detailed walkthrough of scaled dot-product self-attention, including Q/K/V projections, score scaling, softmax weights, and multi-head intuition.

advanced120s8 frames · step through
Self-Attention MechanismHow transformers understand contextSentence: “The model learns context”Themodellearnscontextstrongest attention
Frame 1 of 8
\text{Goal: contextualize each token using all other tokens}

Why Self-Attention?