What is the LLM Internals track?

Nine interactive modules covering tokenization, embeddings, self-attention, the transformer block, text generation, KV cache, quantization, batching, and paged attention.

Where should I start?

Module 1 (Tokenization). Each module stands alone but builds on the prior one.

Do I need math background?

Basic linear algebra helps, but every concept is illustrated visually first.

Why stack multiple transformer blocks instead of one big block?

Each block refines the representation. Early blocks capture syntax and local patterns; later blocks capture semantics, reasoning, and long-range dependencies. It's like multiple passes of editing. Depth also compounds expressivity: a single attention layer can only express one-hop relationships between tokens, but N layers can compose N-hop reasoning chains. Production LLMs typically stack 32–96 blocks — LLaMA 3 8B uses 32, LLaMA 3 70B uses 80, and GPT-3 175B uses 96 — and scaling depth alongside width is a core recipe for quality gains.

What is layer normalization?

It normalizes each token's activations to have zero mean and unit variance, then applies learned scale and shift. This prevents values from growing or shrinking as they pass through dozens of layers, keeping training stable. Unlike batch normalization, it operates per-token and per-sequence, which makes it independent of batch size and well-suited to variable-length inputs. Modern LLMs (LLaMA, Mistral) use a simplified variant called RMSNorm, which skips the mean-centering step and is ~10% faster while performing comparably.

What are residual connections?

The output of each sub-layer is added to its input: output = sublayer(x) + x. This creates a "shortcut" for gradients to flow backward during training, making deep networks trainable. Without residuals, gradients vanish exponentially with depth and the network cannot learn. Residuals also give each sub-layer a natural default of "do nothing" — if the sub-layer outputs zero, the input passes through unchanged. Introduced by ResNet in 2015, they are now standard in every deep architecture, including every modern transformer block.

What is the difference between pre-norm and post-norm?

Post-norm (original Transformer) normalizes after the residual add. Pre-norm (GPT-2 onward) normalizes before the sub-layer. Pre-norm is more stable for training deep models and is now standard. The reason is gradient flow: with pre-norm, the residual path is an unmodified identity, so gradients reach early layers even at 100+ layers of depth. Post-norm disrupts that clean path and typically requires careful learning-rate warmup to train at scale. Every major modern LLM — GPT, LLaMA, Mistral, Claude — uses pre-norm.

A modern activation function used in LLaMA and other recent models. It replaces the original ReLU in the feed-forward network. SwiGLU uses a gating mechanism that improves model quality with similar parameter count. Mechanically, it computes two parallel linear projections from the same input, applies a Swish activation to one, and multiplies them together before the second linear layer. The gating element-wise modulates how much each channel contributes, giving a richer nonlinearity than ReLU or GELU at only ~50% extra FFN parameters — a win on quality-per-FLOP benchmarks.

Transformer Block Architecture Explained

The Transformer Block

What is a Transformer Block?

A transformer block is the repeating unit that makes up the entire model. Each block contains two sub-layers: a multi-head self-attention layer (which lets tokens communicate) and a feed-forward network (which processes each token independently). Both sub-layers use residual connections (adding the input back to the output) and layer normalization to keep training stable. GPT-3 stacks 96 of these blocks; LLaMA 2 70B uses 80. The depth — number of stacked blocks — determines how many rounds of "thinking" the model can do.

The Transformer Block

A transformer is a stack of identical blocks. GPT-2 has 12. GPT-3 has 96. Llama-2 has 32. Each block sees the same shape of data and refines it — every single one running the same pattern.

One Block, Two Sub-Layers

Every transformer block contains exactly two sub-layers:

Multi-Head Attention (MHA) — lets each token gather information from all other tokens
Feed-Forward Network (FFN) — transforms each token independently, applying learned knowledge

But attention and FFN don't work alone. Each one has two helpers:

Layer Normalization — as data passes through 96 blocks, numbers can grow huge or shrink to near zero. LayerNorm rescales them to a healthy range before each sub-layer — like adjusting the volume so the signal stays clear.
Residual connection — data enters a sub-layer (e.g., attention), but a copy of the data before the sub-layer is kept aside. After the sub-layer finishes, its output is added to that saved copy. So the result is: original + what the sub-layer learned. Nothing from the original is lost — the sub-layer only adds new information on top. Why? Without this, a 96-layer model would gradually lose the original signal — each layer would overwrite the previous one. The residual connection guarantees that information can flow straight from the first layer to the last, untouched.

Look at the right panel. The diagram shows this structure — data flows top to bottom. The dashed lines on the left show where the original input bypasses the sub-layer and gets added back at the "+" boxes. Click any box to read what it computes.

The 2-Line Block

Andrej Karpathy's nanoGPT captures the entire block in two lines:

x = x + attn(ln1(x))
x = x + ffn(ln2(x))

That's it. Every major LLM — GPT-4, Claude, Gemini, Llama — runs this pattern billions of times per forward pass. The elegance is in the simplicity: normalize, transform, add back.

Stacking Blocks

A single block captures limited patterns. Stack many blocks and the model builds up understanding layer by layer — each layer sees the output of all layers before it:

Each block inherits everything from the blocks before it — the "add back" pattern (residual connections) means original information is never lost, while new understanding accumulates on top with each layer.

Understanding one block means understanding all of them. The same two lines — attention residual, FFN residual — repeat unchanged from the first block to the last, whether a model has 12 layers or 96.

Layer Normalization

Before each sub-layer runs, the input passes through a normalization step. Without it, activations accumulate errors across 96 blocks — numbers grow too large or shrink toward zero, and gradients become untrainable.

How Layer Normalization Works

For each token vector independently, Layer Normalization:

Computes the mean μ across all d_model dimensions
Computes the standard deviation σ
Subtracts the mean and divides by σ — centering at zero, unit variance
Applies learned scale γ (gamma) and shift β (beta) to restore representational capacity

LN(x) = γ · (x − μ) / σ + β

The key detail: normalization happens per token, not per batch. Each token's d_model-dimensional vector is normalized independently. This is what distinguishes LayerNorm from BatchNorm — it works correctly with any sequence length and batch size, including batch size 1.

Analogy: Auto-Adjusting Brightness

Think of each token's vector as a photo. Layer Normalization is auto-adjust — it standardizes brightness and contrast regardless of how dark or washed-out the original was. The learned γ and β parameters let the model re-apply its preferred "exposure" after normalization.

RMSNorm — The Modern Variant

Modern LLMs — Llama, Mistral, Falcon — replace LayerNorm with RMSNorm:

RMSNorm(x) = γ · x / RMS(x)    where RMS(x) = √(mean(x²))

Two changes from LayerNorm:

No mean subtraction — skip centering entirely
No β shift — only one set of learned parameters (γ)

RMSNorm is faster and uses less memory. Empirically it matches LayerNorm quality while cutting the normalization compute nearly in half.

RMSNorm saves compute per layer. Multiply that saving across 96 layers and billions of tokens per training run, and the difference is substantial. This is why every frontier model since Llama adopted it.

Residual Connections

Look at the + in x = x + attn(ln1(x)). That addition is a residual connection — and it is arguably the most important architectural decision in the transformer.

Adding a Delta, Not Replacing

Without the residual:

x = attn(ln1(x))   # sub-layer replaces x entirely

With the residual:

x = x + attn(ln1(x))   # sub-layer adds a change to x

The sub-layer only needs to learn what to change — the delta Δ. The original information flows through untouched. This is a much easier learning problem than reconstructing x from scratch.

Gradient Flow: The Highway

Deep networks have a vanishing gradient problem. Backpropagation multiplies gradients through each layer — after 96 multiplications, the gradient reaching the first block can be nearly zero, and the early layers learn nothing.

Residual connections solve this by creating a gradient highway. The gradient can flow directly through the addition operation without any transformation — it passes through every block unchanged. Deep training becomes feasible.

This insight came from ResNets in 2015 (He et al.) — a computer vision architecture. The transformer borrowed it directly.

Analogy: Tracked Changes

Writing a document in "tracked changes" mode: every edit is layered on top of the original text, not replacing it. You can always see what changed and roll back. Each transformer block makes tracked changes to the representation — the original signal is never lost.

Without residual connections, training 96 layers would be nearly impossible — gradients would vanish long before reaching the early blocks. The residual connection is the reason that "more layers = more capable" actually works at scale.

The Feed-Forward Network

Attention and FFN do two very different jobs:

Attention = a group meeting. Every token talks to every other token and gathers information: "who is relevant to me?" After attention, the token for "sat" now knows about "cat" and "mat."
FFN = individual thinking. Each token goes back to its own desk and processes what it just learned — alone, with no further communication. The token for "sat" takes its gathered context and thinks: "given that I know about cat and mat, what should my updated representation be?"

Attention is communication. FFN is computation.

Expand → Activate → Contract

The FFN is a two-layer MLP with a specific shape pattern:

Input:    [batch, seq, 768]     # d_model
Expand:   [batch, seq, 3072]    # 4 × d_model
Activate: [batch, seq, 3072]    # non-linearity
Contract: [batch, seq, 768]     # back to d_model

Why Expand Then Contract?

The expanded space is a temporary workspace. With 3072 dimensions, the model can form rich intermediate combinations that couldn't be expressed in 768 dimensions. Then it contracts back — keeping only what's useful, discarding the rest.

Contraction back to d_model is not optional. It's what keeps the transformer shape-preserving: every block outputs the same dimensions it takes as input, so blocks can stack indefinitely.

FFN as Knowledge Storage

Research from Geva et al. (2021) showed that FFN layers function as key-value memory banks. The first weight matrix W₁ acts as keys — patterns to match. The second matrix W₂ acts as values — what information to retrieve when a pattern matches.

Factual associations like "Paris is the capital of France" are stored in FFN neurons. Attention routes information; FFN stores and retrieves it.

Activation Functions: ReLU → GELU → SwiGLU

The activation function sits between the expand and contract steps. It decides which signals pass through and which get suppressed. Toggle the curves below to compare:

ReLUmax(0, x) — sharp cutoff at 0

GELUsmooth curve — allows small negatives

SwiGLUx × sigmoid(x) — learned gating

Each generation solved a problem with the previous one:

ReLU (2012) — the breakthrough that made deep learning work. Everything negative becomes exactly zero: max(0, x). Fast and simple. But it has a flaw: once a neuron outputs negative, its gradient is exactly zero — the neuron "dies" and stops learning forever. In a 96-layer model, many neurons die.
GELU (2016, used in GPT/BERT) — fixes the "dying neuron" problem. Instead of a hard cutoff at zero, it smoothly curves — small negative values still get through slightly. This means gradients are never exactly zero, so every neuron keeps learning. Toggle ReLU off and GELU on in the graph above to see the smooth transition near x=0.
SwiGLU (2020, used in Llama/Mistral/PaLM) — fixes a different problem: in ReLU and GELU, the activation function is fixed — it applies the same transformation regardless of input. SwiGLU adds a learned gate: a second branch of weights decides which dimensions to keep and which to suppress, adapting per input. The model can learn "for this token, dimensions 5-10 matter; for that token, dimensions 50-60 matter." More parameters, but measurably better quality.

The FFN contains roughly 2/3 of a transformer's total parameters. In GPT-3, the four FFN weight matrices per block dominate the 175B parameter count. This layer is where the model's "knowledge" is stored — the attention mechanism retrieves context, but FFN is the library.

Pre-Norm vs Post-Norm

There are two ways to place Layer Normalization inside a residual block. This ordering looks like a minor implementation detail — at 96 layers deep, it determines whether training converges at all.

Two Orderings

Post-Norm (original 2017 paper):

x = LN(x + sublayer(x))   # normalize after residual add

Pre-Norm (modern standard):

x = x + sublayer(LN(x))   # normalize before sublayer

The difference: in Post-Norm, the normalization sits outside the residual addition and sees the combined signal. In Pre-Norm, normalization sits inside — the residual path bypasses it entirely.

Why Pre-Norm Won

In Pre-Norm, the residual connection carries the unmodified input directly to the addition — LN never touches the highway. Gradient magnitude is preserved across every block.

In Post-Norm, every gradient must pass through the normalization operation at each layer boundary. This destabilizes training at depth, requiring careful learning rate warm-up schedules and lower initial rates. At 96 layers, the instability compounds.

Pre-Norm trains reliably without special warm-up schedules. Every major modern LLM — GPT-2, GPT-3, Llama, Mistral, Falcon, Claude — uses Pre-Norm.

Karpathy's Code

In nanoGPT, Karpathy writes it directly:

x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))

ln_1 and ln_2 are called inside the sub-layer calls — LN runs first, then the sub-layer, then the residual add. That's Pre-Norm. The names ln_1 and ln_2 reflect that there are two norms per block, one per sub-layer.

Toggle the Pre-Norm / Post-Norm switch in the right panel to see where the normalization boxes appear in the data flow.

A trivial ordering difference — normalize before or after — but at 96 layers deep it determines whether the model trains at all. Pre-Norm is now the universal standard. Post-Norm is a historical curiosity.

Modern Variants & Scale

The core transformer block has not changed since "Attention Is All You Need" in 2017. What has changed are the components inside it — each swap improving quality, speed, or both.

What Changed: 2017 → Today

Five components swapped out, same overall structure:

Normalization: LayerNorm (γ, β) → RMSNorm (γ only) — simpler, faster, same quality
FFN activation: ReLU → SwiGLU (gated) — learned gating, better quality
Position encoding: Sinusoidal (absolute) → RoPE (rotary, relative) — handles longer sequences
Attention: Standard (O(n²) memory) → Flash Attention (IO-aware) — 2-5× faster, less memory
Norm placement: Post-Norm → Pre-Norm — trains stably at 96+ layers

RoPE (Rotary Position Embedding) encodes relative position by rotating query and key vectors — it naturally extrapolates to longer sequences than seen during training. Covered in the Embeddings module.

Flash Attention is not a new algorithm — it computes the same attention scores — but it reorganizes memory access to tile computations within GPU SRAM instead of writing intermediate results to HBM. The result: 2–5× faster, far less memory. Every production LLM uses it.

Weight tying shares the embedding matrix and the output projection matrix (they are both d_vocab × d_model). The output projection maps the final hidden state back to logits over the vocabulary. Using the same weights as the input embedding eliminates one full copy of that large matrix — saving hundreds of millions of parameters in large models.

Parameter Counting

A single transformer block contains approximately 12d² parameters (where d is d_model):

Attention: Q, K, V, O projection matrices → ~4d²
FFN: two weight matrices with 4× expansion → ~8d²

For GPT-3 (d = 12288):

12 × 12288² ≈ 1.8B parameters per block
1.8B × 96 blocks ≈ 172B parameters

The remaining ~3B come from embeddings and layer norms — the blocks dominate.

The transformer block's architecture has been stable since 2017. What changed were the components within: RMSNorm for speed, SwiGLU for quality, RoPE for length generalization, Flash Attention for efficiency. The two-line structure remains unchanged.