Learn AI VisuallyTracksAI Explained

Self-Attention Mechanism Explained Visually

Why Attention?

What is Self-Attention?

Self-attention is the mechanism that lets each token in a sequence look at every other token to decide what's relevant. Every token is projected into three roles: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I carry?). Attention scores are computed by comparing each query against all keys, then used to create a weighted sum of values. Multi-head attention runs this process in parallel with different learned projections, allowing the model to attend to different types of relationships simultaneously.

Why Attention?

Before transformers, language models read text sequentially — one word at a time, left to right. Long-range connections faded. Consider this sentence:

"The trophy wouldn't fit in the bag because it was too big."

What does "it" refer to — the trophy or the bag? A sequential model has already moved on by the time it processes "it". Attention solves this by letting every token look at every other token simultaneously and decide what matters.

Each tokenposition in sentence
Looks at all tokenssimultaneously
Decides what matterslearns weights
Updated representationcontext-aware vector

Three superpowers

Long-range dependencies — Attention can connect "it" to "trophy" regardless of how many words lie between them. Distance stops being a barrier.

Context-sensitive meaning — "bank" means something different near "river" than near "money". Attention lets a token's representation shift based on what surrounds it.

Parallelism — Unlike RNNs that process tokens one at a time, attention computes all token relationships at once. Every modern GPU is built to love this kind of work.

Why it matters in practice

Attention is the reason transformer models outperform everything that came before them. It powers:

  • Translation — connecting words in one language to their counterparts across a sentence
  • Summarization — identifying which sentences are most relevant to the main idea
  • Code completion — linking a variable declaration to where it's used 50 lines later
  • Retrieval — matching a question to the most relevant passage in a document

Attention is the mechanism that made transformers possible. Every modern LLM — GPT, Gemini, Claude — is built on this. You're about to learn exactly how it works, step by step.

Q, K, V: Three Roles

Q, K, V: Three Roles

Every token generates three vectors from its embedding. These aren't arbitrary — each has a specific job in the attention mechanism.

Embeddingtoken vector
× W_Q
QQuery
× W_K
KKey
× W_V
VValue

The three vectors

Query (Q) — "What am I looking for?" When processing "sat", the query encodes something like: I'm a verb, I need a subject.

Key (K) — "What do I contain?" Each token broadcasts what kind of thing it is. "cat" has a key that says: I'm a noun, I'm an agent.

Value (V) — "What information do I carry?" If I get attention, this is what I contribute to the output.

The library analogy

Walk into a library with a query — you're searching for books on machine learning. Every book has a key on its spine (title, author, subject). You compare your query against every key to find matches. When you find a match, you read the book's value (its actual content).

Attention works the same way. Every token in a sentence simultaneously plays searcher (Q) and searchee (K, V).

Where W_Q, W_K, W_V come from

These weight matrices are learned during training — they start random and get updated by gradient descent over billions of examples. The model discovers, on its own, what Q, K, V representations make useful attention patterns.

Try clicking different tokens on the right panel to see their Q, K, and V vectors. Notice how each token produces three distinct vectors from the same embedding.

Q, K, V are not intuitive at first — they become clear when you see scores computed in the next step. The "library" analogy only gets you so far; the real insight is that dot products between Q and K measure alignment, which determines how information flows.

Computing Attention Scores

Computing Attention Scores

Now the mechanism kicks in. Every Query gets compared against every Key via dot product — a single number that measures how well they align.

score(i, j) = Q_i · K_j

A high dot product means the two vectors point in the same direction — strong alignment, strong attention. A low or negative dot product means weak alignment.

Q_i · K_jdot product
raw scoreone per pair
÷ √dscaling
softmaxnormalization
attention weightsums to 1.0

Why divide by √d?

Raw dot products grow with the vector dimension d. In a model with d = 512, raw scores can be very large — which pushes softmax into a regime where the largest value dominates completely and the output becomes almost a hard argmax.

Dividing by √d keeps the scores in a stable range where softmax can blend from multiple sources, not just crown one winner.

Try the ÷√d_k Scaling toggle on the right. Flip it OFF and watch the Scores cells brighten dramatically — then jump to the Softmax stage and see almost all the mass collapse to a single column per row. That collapse is what scaling prevents: when softmax saturates, the gradient through this layer is ≈ 0 and the model can't learn to blend across tokens.

Softmax normalization

After scaling, softmax converts each row of scores into a probability distribution:

weight(i, j) = exp(score_scaled(i,j)) / Σ_k exp(score_scaled(i,k))

Each row sums to exactly 1.0. These weights tell you: when computing the output for token i, how much does each token j contribute?

The complete formula

Everything so far combines into one expression:

Attention(Q, K, V) = softmax(QKᵀ / √d) · V

This is the core of every transformer. Q, K, V are matrices — so this is computed for all tokens in parallel.

The scores matrix is the heart of attention. Row i shows how token i distributes its attention across the sentence. Column j shows how much attention token j receives from everyone else. Watch the heatmap on the right — "sat" attends most to "cat" because the verb is seeking its subject.

From Scores to Output

From Scores to Output

The softmax weights tell us how much each token matters. Now we use those weights to blend the Value vectors together.

output_i = Σ_j (weight_ij × V_j)
Softmax weightsrow i
× Value vectorsV_0 … V_n
Weighted sumper-token
Outputcontext-aware vector

How context flows

If "sat" attends 60% to "cat" and 30% to "mat" and 10% to everything else, its output vector is approximately:

output["sat"] = 0.60 × V["cat"] + 0.30 × V["mat"] + 0.10 × ...

The verb's output now carries information about its subject ("cat") and the location phrase ("mat"). The model didn't need grammar rules — it learned which tokens matter from data alone.

Before and after attention

Before attention: every token only knows its own embedding — a static representation from the embedding table, the same regardless of context.

After attention: every token's representation has been updated to carry information from all the tokens it attended to. "it" in "The trophy wouldn't fit because it was too big" now carries information about "trophy", resolving the reference.

What the output feeds into

These output vectors — one per token, same count as input — flow through the rest of the transformer layer:

Inputembeddings
Thecatsatonthemat
Attentioncontext mixing
+ Residualadd & norm
Feed-Forwardtransform
Outputto next layer

Each column is one token's vector moving through the layer. Notice how the attention stage has crossing lines — that's the context mixing, where each output draws information from multiple inputs. The residual and feed-forward stages process each token independently (straight vertical lines).

A "32-layer model" means this entire block — attention + residual + feed-forward — is stacked 32 times as one unit. Each repetition is called a "layer" or "transformer block." Layer 1's output becomes layer 2's input, which feeds into layer 3, and so on.

Why not just one block?

A single attention layer can only capture simple, direct relationships — "cat" is near "sat." But language requires multi-hop reasoning: understanding "The trophy wouldn't fit in the bag because it was too big" requires connecting "it" → "trophy" → "big" → "wouldn't fit" across several reasoning steps.

Each layer builds on the previous layer's context-enriched output:

  • Early layers (1–5): basic patterns — word pairs, adjacent tokens, simple syntax
  • Middle layers (10–20): phrases and clauses — "sat on the mat" as a location, subject-verb agreement
  • Deep layers (30+): abstract reasoning — coreference ("it" = "trophy"), logical inference, world knowledge

One layer can't do all of this — just like you can't understand a paragraph by reading one word at a time without building up meaning. More layers = more steps of reasoning.

  • Llama-2 7B: 32 blocks stacked
  • GPT-3 175B: 96 blocks stacked

The output has the same shape as the input — same number of tokens, same vector dimensions — but each vector is now context-aware. This is the fundamental transformation: static embedding in, context-enriched representation out. Everything downstream benefits from it.

Multi-Head Attention

Multi-Head Attention

One attention head captures one type of relationship. But language has many: syntactic dependencies, semantic similarity, positional proximity, coreference. One head can't learn all of them simultaneously — different relationship types require different representations of Q, K, V.

Multi-head attention runs N independent attention computations in parallel, each with its own learned weight matrices W_Q, W_K, W_V.

What each head learns

In practice, different heads specialize in different patterns:

  • Positional head — attends to nearby tokens; learns local syntax and phrase structure
  • Syntactic head — connects subjects to verbs, nouns to their modifiers
  • Semantic head — links words with related meanings regardless of distance
  • Copy head — attends to identical or nearly identical tokens; useful for repetition and reference

These labels come from human analysis of trained models — the model doesn't get told what to learn. It discovers useful patterns entirely from the training signal.

How outputs combine

Each head produces an output matrix the same shape as a single-head output. All N heads are concatenated along the feature dimension, then projected through a final linear layer:

MultiHead(Q, K, V) = Concat(head_1, …, head_N) × W_O

The projection W_O lets the model combine patterns from all heads into a single unified representation.

Scale in practice

GPT-3 uses 96 heads per layer. Each head independently discovers its own relationship pattern from data alone.

Heads vs layers — don't confuse them

← heads (parallel / width) →Layer 3H1H2H3H4Layer 2H1H2H3H4Layer 1H1H2H3H4layers (sequential / depth)
  • Multiple heads = parallel within one layer (horizontal). All heads run at the same time on the same input, each looking for a different pattern. Their outputs are concatenated into one result.
  • Multiple layers = sequential across the model (vertical). Each layer takes the previous layer's output, runs multi-head attention + feed-forward, and passes the result to the next layer.

GPT-3 has 96 heads × 96 layers = 9,216 independent attention computations per forward pass. Heads work in parallel (width), layers work in sequence (depth).

Multiple heads capture syntax, semantics, position, coreference, and patterns humans haven't named — all in parallel. This is a core reason transformers generalize so well: they're not limited to one inductive bias about what relationships matter.

Causal Masking

Causal Masking

Not all transformers read the whole sentence. There are two fundamental architectures:

BERT (bidirectional)ThecatsatonthematThecatsatonthematGPT (causal)ThecatsatonthematThecatsatonthematevery token sees every tokeneach token sees only past tokens
  • BERT — bidirectional. Every token sees every other token (full grid). Designed for understanding: classification, NER, question answering.
  • GPT — causal (autoregressive). Each token sees only past tokens (lower triangle). Designed for generation: complete this sentence, predict the next word.

How does GPT enforce this? With a causal mask.

How the mask works

Before softmax, set every score where j > i to −∞:

masked_score(i, j) = score(i, j)   if j ≤ i
                   = −∞             if j > i

After softmax, exp(−∞) = 0 — future tokens get exactly zero weight. They can't contribute to the output.

Scores matrixall pairs
Apply maskfuture = −∞
Softmaxexp(−∞) → 0
Triangular patternpast only

The lower-triangular pattern

The result is a lower-triangular attention matrix:

  • Token 0 sees only itself
  • Token 1 sees tokens 0–1
  • Token 2 sees tokens 0–2
  • Token 5 ("mat") sees tokens 0–5

No token can peek ahead.

Why this enables generation

When generating "The cat sat on the ___", the model predicts each new token using only what came before. The causal mask guarantees this constraint during training — so at inference time, the model has only ever learned to predict the next token from past context.

On the right, toggle the Causal Mask and click any token to inspect its row. Watch the row sum stay at 1.000 even as masked cells go dark — the unmasked cells grow to compensate.

Every time you chat with GPT or Claude, causal masking is at work. The model generates one token at a time, attending only to what came before. The mask is not a limitation — it's the mechanism that makes autoregressive generation possible.

Further reading

  • Jay Alammar — The Illustrated Transformer — visual walkthrough of masked attention in GPT
  • Lilian Weng — Attention? Attention! — mathematical formulation of self-attention and masking
© 2026 Learn AI Visuallycraftsman@craftsmanapps.com