What is the LLM Internals track?

Nine interactive modules covering tokenization, embeddings, self-attention, the transformer block, text generation, KV cache, quantization, batching, and paged attention.

Where should I start?

Module 1 (Tokenization). Each module stands alone but builds on the prior one.

Do I need math background?

Basic linear algebra helps, but every concept is illustrated visually first.

Why three separate matrices (Q, K, V)?

Separating "what I'm searching for" (Q) from "what I'm advertising" (K) from "what I'll contribute" (V) lets the model learn flexible relationships. A token can search for one pattern while contributing different information. If Q and K were tied, every token would match itself most strongly; if K and V were tied, addressing and content would be forced to use the same representation. The three-way split lets attention act like a learned soft dictionary: Q is the query, K is the key used for matching, V is the value returned.

What does "multi-head" mean in multi-head attention?

Instead of one large attention computation, the model runs several smaller ones in parallel (e.g., 32 heads). Each head can specialize — one might track syntax, another semantics, another coreference. The hidden dimension is split across heads, so a 4096-dim model with 32 heads gives each head a 128-dim subspace. Outputs from all heads are concatenated and projected back, letting the model attend to different relationships simultaneously at the same cost as one large head.

What is causal masking?

In generative models like GPT, each token can only attend to tokens before it (not future tokens). This is enforced by masking future positions with negative infinity before softmax, creating a triangular attention pattern. After softmax those masked positions contribute zero weight, so information cannot leak backward in time. Causal masking is what makes the training objective — predicting the next token — honest: the model never sees the answer while computing its prediction, and the same weights work for autoregressive generation at inference time.

Why does attention scale with sequence length squared?

Each of the N tokens computes a score against all N tokens, giving N² score computations. This is why long context windows are expensive — doubling sequence length quadruples attention cost. At 100K tokens the attention matrix alone has 10 billion entries per layer per head, which is why long-context serving depends on optimizations like FlashAttention (keeping the matrix in SRAM), sliding-window attention (bounding N), and KV cache sharing. The feed-forward layer remains linear in N, so long contexts shift the bottleneck firmly to attention.

How is attention different from a lookup table?

A lookup table returns one exact match. Attention computes a soft weighted average over all tokens, so each output is a blend of information from the entire sequence, weighted by relevance. The weights come from softmax over Q·K similarity scores, so the lookup is both content-addressable (matches happen by meaning, not position) and differentiable (gradients flow through all positions). This softness is what lets attention learn nuanced relationships like coreference and long-range dependencies that a hard lookup could never express.

Self-Attention Mechanism Explained Visually

Why Attention?

What is Self-Attention?

Self-attention is the mechanism that lets each token in a sequence look at every other token to decide what's relevant. Every token is projected into three roles: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what information do I carry?). Attention scores are computed by comparing each query against all keys, then used to create a weighted sum of values. Multi-head attention runs this process in parallel with different learned projections, allowing the model to attend to different types of relationships simultaneously.

Why Attention?

Before transformers, language models read text sequentially — one word at a time, left to right. Long-range connections faded. Consider this sentence:

"The trophy wouldn't fit in the bag because it was too big."

What does "it" refer to — the trophy or the bag? A sequential model has already moved on by the time it processes "it". Attention solves this by letting every token look at every other token simultaneously and decide what matters.

Three superpowers

Long-range dependencies — Attention can connect "it" to "trophy" regardless of how many words lie between them. Distance stops being a barrier.

Context-sensitive meaning — "bank" means something different near "river" than near "money". Attention lets a token's representation shift based on what surrounds it.

Parallelism — Unlike RNNs that process tokens one at a time, attention computes all token relationships at once. Every modern GPU is built to love this kind of work.

Why it matters in practice

Attention is the reason transformer models outperform everything that came before them. It powers:

Translation — connecting words in one language to their counterparts across a sentence
Summarization — identifying which sentences are most relevant to the main idea
Code completion — linking a variable declaration to where it's used 50 lines later
Retrieval — matching a question to the most relevant passage in a document

Attention is the mechanism that made transformers possible. Every modern LLM — GPT, Gemini, Claude — is built on this. You're about to learn exactly how it works, step by step.

Q, K, V: Three Roles

Every token generates three vectors from its embedding. These aren't arbitrary — each has a specific job in the attention mechanism.

The three vectors

Query (Q) — "What am I looking for?" When processing "sat", the query encodes something like: I'm a verb, I need a subject.

Key (K) — "What do I contain?" Each token broadcasts what kind of thing it is. "cat" has a key that says: I'm a noun, I'm an agent.

Value (V) — "What information do I carry?" If I get attention, this is what I contribute to the output.

The library analogy

Walk into a library with a query — you're searching for books on machine learning. Every book has a key on its spine (title, author, subject). You compare your query against every key to find matches. When you find a match, you read the book's value (its actual content).

Attention works the same way. Every token in a sentence simultaneously plays searcher (Q) and searchee (K, V).

Where W_Q, W_K, W_V come from

These weight matrices are learned during training — they start random and get updated by gradient descent over billions of examples. The model discovers, on its own, what Q, K, V representations make useful attention patterns.

Try clicking different tokens on the right panel to see their Q, K, and V vectors. Notice how each token produces three distinct vectors from the same embedding.

Q, K, V are not intuitive at first — they become clear when you see scores computed in the next step. The "library" analogy only gets you so far; the real insight is that dot products between Q and K measure alignment, which determines how information flows.

Computing Attention Scores

Now the mechanism kicks in. Every Query gets compared against every Key via dot product — a single number that measures how well they align.

score(i, j) = Q_i · K_j

A high dot product means the two vectors point in the same direction — strong alignment, strong attention. A low or negative dot product means weak alignment.

Why divide by √d?

Raw dot products grow with the vector dimension d. In a model with d = 512, raw scores can be very large — which pushes softmax into a regime where the largest value dominates completely and the output becomes almost a hard argmax.

Dividing by √d keeps the scores in a stable range where softmax can blend from multiple sources, not just crown one winner.

Try the ÷√d_k Scaling toggle on the right. Flip it OFF and watch the Scores cells brighten dramatically — then jump to the Softmax stage and see almost all the mass collapse to a single column per row. That collapse is what scaling prevents: when softmax saturates, the gradient through this layer is ≈ 0 and the model can't learn to blend across tokens.

Softmax normalization

After scaling, softmax converts each row of scores into a probability distribution:

weight(i, j) = exp(score_scaled(i,j)) / Σ_k exp(score_scaled(i,k))

Each row sums to exactly 1.0. These weights tell you: when computing the output for token i, how much does each token j contribute?

The complete formula

Everything so far combines into one expression:

Attention(Q, K, V) = softmax(QKᵀ / √d) · V

This is the core of every transformer. Q, K, V are matrices — so this is computed for all tokens in parallel.

The scores matrix is the heart of attention. Row i shows how token i distributes its attention across the sentence. Column j shows how much attention token j receives from everyone else. Watch the heatmap on the right — "sat" attends most to "cat" because the verb is seeking its subject.

From Scores to Output

The softmax weights tell us how much each token matters. Now we use those weights to blend the Value vectors together.

output_i = Σ_j (weight_ij × V_j)

How context flows

If "sat" attends 60% to "cat" and 30% to "mat" and 10% to everything else, its output vector is approximately:

output["sat"] = 0.60 × V["cat"] + 0.30 × V["mat"] + 0.10 × ...

The verb's output now carries information about its subject ("cat") and the location phrase ("mat"). The model didn't need grammar rules — it learned which tokens matter from data alone.

Before and after attention

Before attention: every token only knows its own embedding — a static representation from the embedding table, the same regardless of context.

After attention: every token's representation has been updated to carry information from all the tokens it attended to. "it" in "The trophy wouldn't fit because it was too big" now carries information about "trophy", resolving the reference.

What the output feeds into

These output vectors — one per token, same count as input — flow through the rest of the transformer layer:

Each column is one token's vector moving through the layer. Notice how the attention stage has crossing lines — that's the context mixing, where each output draws information from multiple inputs. The residual and feed-forward stages process each token independently (straight vertical lines).

A "32-layer model" means this entire block — attention + residual + feed-forward — is stacked 32 times as one unit. Each repetition is called a "layer" or "transformer block." Layer 1's output becomes layer 2's input, which feeds into layer 3, and so on.

Why not just one block?

A single attention layer can only capture simple, direct relationships — "cat" is near "sat." But language requires multi-hop reasoning: understanding "The trophy wouldn't fit in the bag because it was too big" requires connecting "it" → "trophy" → "big" → "wouldn't fit" across several reasoning steps.

Each layer builds on the previous layer's context-enriched output:

Early layers (1–5): basic patterns — word pairs, adjacent tokens, simple syntax
Middle layers (10–20): phrases and clauses — "sat on the mat" as a location, subject-verb agreement
Deep layers (30+): abstract reasoning — coreference ("it" = "trophy"), logical inference, world knowledge

One layer can't do all of this — just like you can't understand a paragraph by reading one word at a time without building up meaning. More layers = more steps of reasoning.

Llama-2 7B: 32 blocks stacked
GPT-3 175B: 96 blocks stacked

The output has the same shape as the input — same number of tokens, same vector dimensions — but each vector is now context-aware. This is the fundamental transformation: static embedding in, context-enriched representation out. Everything downstream benefits from it.

Multi-Head Attention

One attention head captures one type of relationship. But language has many: syntactic dependencies, semantic similarity, positional proximity, coreference. One head can't learn all of them simultaneously — different relationship types require different representations of Q, K, V.

Multi-head attention runs N independent attention computations in parallel, each with its own learned weight matrices W_Q, W_K, W_V.

What each head learns

In practice, different heads specialize in different patterns:

Positional head — attends to nearby tokens; learns local syntax and phrase structure
Syntactic head — connects subjects to verbs, nouns to their modifiers
Semantic head — links words with related meanings regardless of distance
Copy head — attends to identical or nearly identical tokens; useful for repetition and reference

These labels come from human analysis of trained models — the model doesn't get told what to learn. It discovers useful patterns entirely from the training signal.

How outputs combine

Each head produces an output matrix the same shape as a single-head output. All N heads are concatenated along the feature dimension, then projected through a final linear layer:

MultiHead(Q, K, V) = Concat(head_1, …, head_N) × W_O

The projection W_O lets the model combine patterns from all heads into a single unified representation.

Scale in practice

GPT-3 uses 96 heads per layer. Each head independently discovers its own relationship pattern from data alone.

Heads vs layers — don't confuse them

Multiple heads = parallel within one layer (horizontal). All heads run at the same time on the same input, each looking for a different pattern. Their outputs are concatenated into one result.
Multiple layers = sequential across the model (vertical). Each layer takes the previous layer's output, runs multi-head attention + feed-forward, and passes the result to the next layer.

GPT-3 has 96 heads × 96 layers = 9,216 independent attention computations per forward pass. Heads work in parallel (width), layers work in sequence (depth).

Multiple heads capture syntax, semantics, position, coreference, and patterns humans haven't named — all in parallel. This is a core reason transformers generalize so well: they're not limited to one inductive bias about what relationships matter.

Causal Masking

Not all transformers read the whole sentence. There are two fundamental architectures:

BERT — bidirectional. Every token sees every other token (full grid). Designed for understanding: classification, NER, question answering.
GPT — causal (autoregressive). Each token sees only past tokens (lower triangle). Designed for generation: complete this sentence, predict the next word.

How does GPT enforce this? With a causal mask.

How the mask works

Before softmax, set every score where j > i to −∞:

masked_score(i, j) = score(i, j)   if j ≤ i
                   = −∞             if j > i

After softmax, exp(−∞) = 0 — future tokens get exactly zero weight. They can't contribute to the output.

The lower-triangular pattern

The result is a lower-triangular attention matrix:

Token 0 sees only itself
Token 1 sees tokens 0–1
Token 2 sees tokens 0–2
Token 5 ("mat") sees tokens 0–5

No token can peek ahead.

Why this enables generation

When generating "The cat sat on the ___", the model predicts each new token using only what came before. The causal mask guarantees this constraint during training — so at inference time, the model has only ever learned to predict the next token from past context.

On the right, toggle the Causal Mask and click any token to inspect its row. Watch the row sum stay at 1.000 even as masked cells go dark — the unmasked cells grow to compensate.

Every time you chat with GPT or Claude, causal masking is at work. The model generates one token at a time, attending only to what came before. The mask is not a limitation — it's the mechanism that makes autoregressive generation possible.