Self-Attention Mechanism Explained Visually
Why Attention?
What is Self-Attention?
Self-attention is the mechanism that lets each token in a sequence look at every other token to decide what's relevant. Every token is projected into three roles: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I carry?). Attention scores are computed by comparing each query against all keys, then used to create a weighted sum of values. Multi-head attention runs this process in parallel with different learned projections, allowing the model to attend to different types of relationships simultaneously.
Why Attention?
Before transformers, language models read text sequentially — one word at a time, left to right. Long-range connections faded. Consider this sentence:
"The trophy wouldn't fit in the bag because it was too big."
What does "it" refer to — the trophy or the bag? A sequential model has already moved on by the time it processes "it". Attention solves this by letting every token look at every other token simultaneously and decide what matters.
Three superpowers
Long-range dependencies — Attention can connect "it" to "trophy" regardless of how many words lie between them. Distance stops being a barrier.
Context-sensitive meaning — "bank" means something different near "river" than near "money". Attention lets a token's representation shift based on what surrounds it.
Parallelism — Unlike RNNs that process tokens one at a time, attention computes all token relationships at once. Every modern GPU is built to love this kind of work.
Why it matters in practice
Attention is the reason transformer models outperform everything that came before them. It powers:
- Translation — connecting words in one language to their counterparts across a sentence
- Summarization — identifying which sentences are most relevant to the main idea
- Code completion — linking a variable declaration to where it's used 50 lines later
- Retrieval — matching a question to the most relevant passage in a document
Attention is the mechanism that made transformers possible. Every modern LLM — GPT, Gemini, Claude — is built on this. You're about to learn exactly how it works, step by step.
Q, K, V: Three Roles
Q, K, V: Three Roles
Every token generates three vectors from its embedding. These aren't arbitrary — each has a specific job in the attention mechanism.
The three vectors
Query (Q) — "What am I looking for?" When processing "sat", the query encodes something like: I'm a verb, I need a subject.
Key (K) — "What do I contain?" Each token broadcasts what kind of thing it is. "cat" has a key that says: I'm a noun, I'm an agent.
Value (V) — "What information do I carry?" If I get attention, this is what I contribute to the output.
The library analogy
Walk into a library with a query — you're searching for books on machine learning. Every book has a key on its spine (title, author, subject). You compare your query against every key to find matches. When you find a match, you read the book's value (its actual content).
Attention works the same way. Every token in a sentence simultaneously plays searcher (Q) and searchee (K, V).
Where W_Q, W_K, W_V come from
These weight matrices are learned during training — they start random and get updated by gradient descent over billions of examples. The model discovers, on its own, what Q, K, V representations make useful attention patterns.
Try clicking different tokens on the right panel to see their Q, K, and V vectors. Notice how each token produces three distinct vectors from the same embedding.
Q, K, V are not intuitive at first — they become clear when you see scores computed in the next step. The "library" analogy only gets you so far; the real insight is that dot products between Q and K measure alignment, which determines how information flows.
Computing Attention Scores
Computing Attention Scores
Now the mechanism kicks in. Every Query gets compared against every Key via dot product — a single number that measures how well they align.
score(i, j) = Q_i · K_j
A high dot product means the two vectors point in the same direction — strong alignment, strong attention. A low or negative dot product means weak alignment.
Why divide by √d?
Raw dot products grow with the vector dimension d. In a model with d = 512, raw scores can be very large — which pushes softmax into a regime where the largest value dominates completely and the output becomes almost a hard argmax.
Dividing by √d keeps the scores in a stable range where softmax can blend from multiple sources, not just crown one winner.
Try the ÷√d_k Scaling toggle on the right. Flip it OFF and watch the Scores cells brighten dramatically — then jump to the Softmax stage and see almost all the mass collapse to a single column per row. That collapse is what scaling prevents: when softmax saturates, the gradient through this layer is ≈ 0 and the model can't learn to blend across tokens.
Softmax normalization
After scaling, softmax converts each row of scores into a probability distribution:
weight(i, j) = exp(score_scaled(i,j)) / Σ_k exp(score_scaled(i,k))
Each row sums to exactly 1.0. These weights tell you: when computing the output for token i, how much does each token j contribute?
The complete formula
Everything so far combines into one expression:
Attention(Q, K, V) = softmax(QKᵀ / √d) · V
This is the core of every transformer. Q, K, V are matrices — so this is computed for all tokens in parallel.
The scores matrix is the heart of attention. Row i shows how token i distributes its attention across the sentence. Column j shows how much attention token j receives from everyone else. Watch the heatmap on the right — "sat" attends most to "cat" because the verb is seeking its subject.
From Scores to Output
From Scores to Output
The softmax weights tell us how much each token matters. Now we use those weights to blend the Value vectors together.
output_i = Σ_j (weight_ij × V_j)
How context flows
If "sat" attends 60% to "cat" and 30% to "mat" and 10% to everything else, its output vector is approximately:
output["sat"] = 0.60 × V["cat"] + 0.30 × V["mat"] + 0.10 × ...
The verb's output now carries information about its subject ("cat") and the location phrase ("mat"). The model didn't need grammar rules — it learned which tokens matter from data alone.
Before and after attention
Before attention: every token only knows its own embedding — a static representation from the embedding table, the same regardless of context.
After attention: every token's representation has been updated to carry information from all the tokens it attended to. "it" in "The trophy wouldn't fit because it was too big" now carries information about "trophy", resolving the reference.
What the output feeds into
These output vectors — one per token, same count as input — flow through the rest of the transformer layer:
Each column is one token's vector moving through the layer. Notice how the attention stage has crossing lines — that's the context mixing, where each output draws information from multiple inputs. The residual and feed-forward stages process each token independently (straight vertical lines).
A "32-layer model" means this entire block — attention + residual + feed-forward — is stacked 32 times as one unit. Each repetition is called a "layer" or "transformer block." Layer 1's output becomes layer 2's input, which feeds into layer 3, and so on.
Why not just one block?
A single attention layer can only capture simple, direct relationships — "cat" is near "sat." But language requires multi-hop reasoning: understanding "The trophy wouldn't fit in the bag because it was too big" requires connecting "it" → "trophy" → "big" → "wouldn't fit" across several reasoning steps.
Each layer builds on the previous layer's context-enriched output:
- Early layers (1–5): basic patterns — word pairs, adjacent tokens, simple syntax
- Middle layers (10–20): phrases and clauses — "sat on the mat" as a location, subject-verb agreement
- Deep layers (30+): abstract reasoning — coreference ("it" = "trophy"), logical inference, world knowledge
One layer can't do all of this — just like you can't understand a paragraph by reading one word at a time without building up meaning. More layers = more steps of reasoning.
- Llama-2 7B: 32 blocks stacked
- GPT-3 175B: 96 blocks stacked
The output has the same shape as the input — same number of tokens, same vector dimensions — but each vector is now context-aware. This is the fundamental transformation: static embedding in, context-enriched representation out. Everything downstream benefits from it.
Multi-Head Attention
Multi-Head Attention
One attention head captures one type of relationship. But language has many: syntactic dependencies, semantic similarity, positional proximity, coreference. One head can't learn all of them simultaneously — different relationship types require different representations of Q, K, V.
Multi-head attention runs N independent attention computations in parallel, each with its own learned weight matrices W_Q, W_K, W_V.
What each head learns
In practice, different heads specialize in different patterns:
- Positional head — attends to nearby tokens; learns local syntax and phrase structure
- Syntactic head — connects subjects to verbs, nouns to their modifiers
- Semantic head — links words with related meanings regardless of distance
- Copy head — attends to identical or nearly identical tokens; useful for repetition and reference
These labels come from human analysis of trained models — the model doesn't get told what to learn. It discovers useful patterns entirely from the training signal.
How outputs combine
Each head produces an output matrix the same shape as a single-head output. All N heads are concatenated along the feature dimension, then projected through a final linear layer:
MultiHead(Q, K, V) = Concat(head_1, …, head_N) × W_O
The projection W_O lets the model combine patterns from all heads into a single unified representation.
Scale in practice
GPT-3 uses 96 heads per layer. Each head independently discovers its own relationship pattern from data alone.
Heads vs layers — don't confuse them
- Multiple heads = parallel within one layer (horizontal). All heads run at the same time on the same input, each looking for a different pattern. Their outputs are concatenated into one result.
- Multiple layers = sequential across the model (vertical). Each layer takes the previous layer's output, runs multi-head attention + feed-forward, and passes the result to the next layer.
GPT-3 has 96 heads × 96 layers = 9,216 independent attention computations per forward pass. Heads work in parallel (width), layers work in sequence (depth).
Multiple heads capture syntax, semantics, position, coreference, and patterns humans haven't named — all in parallel. This is a core reason transformers generalize so well: they're not limited to one inductive bias about what relationships matter.
Causal Masking
Causal Masking
Not all transformers read the whole sentence. There are two fundamental architectures:
- BERT — bidirectional. Every token sees every other token (full grid). Designed for understanding: classification, NER, question answering.
- GPT — causal (autoregressive). Each token sees only past tokens (lower triangle). Designed for generation: complete this sentence, predict the next word.
How does GPT enforce this? With a causal mask.
How the mask works
Before softmax, set every score where j > i to −∞:
masked_score(i, j) = score(i, j) if j ≤ i
= −∞ if j > i
After softmax, exp(−∞) = 0 — future tokens get exactly zero weight. They can't contribute to the output.
The lower-triangular pattern
The result is a lower-triangular attention matrix:
- Token 0 sees only itself
- Token 1 sees tokens 0–1
- Token 2 sees tokens 0–2
- Token 5 ("mat") sees tokens 0–5
No token can peek ahead.
Why this enables generation
When generating "The cat sat on the ___", the model predicts each new token using only what came before. The causal mask guarantees this constraint during training — so at inference time, the model has only ever learned to predict the next token from past context.
On the right, toggle the Causal Mask and click any token to inspect its row. Watch the row sum stay at 1.000 even as masked cells go dark — the unmasked cells grow to compensate.
Every time you chat with GPT or Claude, causal masking is at work. The model generates one token at a time, attending only to what came before. The mask is not a limitation — it's the mechanism that makes autoregressive generation possible.
Further reading
- Jay Alammar — The Illustrated Transformer — visual walkthrough of masked attention in GPT
- Lilian Weng — Attention? Attention! — mathematical formulation of self-attention and masking