Learn AI VisuallyTracksAI Explained

Transformer Block Architecture Explained

The Transformer Block

What is a Transformer Block?

A transformer block is the repeating unit that makes up the entire model. Each block contains two sub-layers: a multi-head self-attention layer (which lets tokens communicate) and a feed-forward network (which processes each token independently). Both sub-layers use residual connections (adding the input back to the output) and layer normalization to keep training stable. GPT-3 stacks 96 of these blocks; LLaMA 2 70B uses 80. The depth — number of stacked blocks — determines how many rounds of "thinking" the model can do.

The Transformer Block

A transformer is a stack of identical blocks. GPT-2 has 12. GPT-3 has 96. Llama-2 has 32. Each block sees the same shape of data and refines it — every single one running the same pattern.

One Block, Two Sub-Layers

Every transformer block contains exactly two sub-layers:

  1. Multi-Head Attention (MHA) — lets each token gather information from all other tokens
  2. Feed-Forward Network (FFN) — transforms each token independently, applying learned knowledge

But attention and FFN don't work alone. Each one has two helpers:

  • Layer Normalization — as data passes through 96 blocks, numbers can grow huge or shrink to near zero. LayerNorm rescales them to a healthy range before each sub-layer — like adjusting the volume so the signal stays clear.
  • Residual connection — data enters a sub-layer (e.g., attention), but a copy of the data before the sub-layer is kept aside. After the sub-layer finishes, its output is added to that saved copy. So the result is: original + what the sub-layer learned. Nothing from the original is lost — the sub-layer only adds new information on top. Why? Without this, a 96-layer model would gradually lose the original signal — each layer would overwrite the previous one. The residual connection guarantees that information can flow straight from the first layer to the last, untouched.

Look at the right panel. The diagram shows this structure — data flows top to bottom. The dashed lines on the left show where the original input bypasses the sub-layer and gets added back at the "+" boxes. Click any box to read what it computes.

The 2-Line Block

Andrej Karpathy's nanoGPT captures the entire block in two lines:

x = x + attn(ln1(x))
x = x + ffn(ln2(x))

That's it. Every major LLM — GPT-4, Claude, Gemini, Llama — runs this pattern billions of times per forward pass. The elegance is in the simplicity: normalize, transform, add back.

Input x[batch, seq, d_model]
Attention + Residualx = x + attn(ln1(x))
FFN + Residualx = x + ffn(ln2(x))
Output xsame shape, richer content

Stacking Blocks

A single block captures limited patterns. Stack many blocks and the model builds up understanding layer by layer — each layer sees the output of all layers before it:

deeper layers → more abstract understanding
Layers 1–5
word pairs, basic grammar
"cat" is near "sat"
Layers 6–15
phrases, subject-verb agreement
"the cat sat" = location phrase
Layers 16–30
semantics, coreference
"it" refers to "trophy"
Layers 30+
reasoning, world knowledge
"too big to fit" = logical inference

Each block inherits everything from the blocks before it — the "add back" pattern (residual connections) means original information is never lost, while new understanding accumulates on top with each layer.

Understanding one block means understanding all of them. The same two lines — attention residual, FFN residual — repeat unchanged from the first block to the last, whether a model has 12 layers or 96.

Layer Normalization

Layer Normalization

Before each sub-layer runs, the input passes through a normalization step. Without it, activations accumulate errors across 96 blocks — numbers grow too large or shrink toward zero, and gradients become untrainable.

How Layer Normalization Works

For each token vector independently, Layer Normalization:

  1. Computes the mean μ across all d_model dimensions
  2. Computes the standard deviation σ
  3. Subtracts the mean and divides by σ — centering at zero, unit variance
  4. Applies learned scale γ (gamma) and shift β (beta) to restore representational capacity
LN(x) = γ · (x − μ) / σ + β

The key detail: normalization happens per token, not per batch. Each token's d_model-dimensional vector is normalized independently. This is what distinguishes LayerNorm from BatchNorm — it works correctly with any sequence length and batch size, including batch size 1.

Analogy: Auto-Adjusting Brightness

Think of each token's vector as a photo. Layer Normalization is auto-adjust — it standardizes brightness and contrast regardless of how dark or washed-out the original was. The learned γ and β parameters let the model re-apply its preferred "exposure" after normalization.

RMSNorm — The Modern Variant

Modern LLMs — Llama, Mistral, Falcon — replace LayerNorm with RMSNorm:

RMSNorm(x) = γ · x / RMS(x)    where RMS(x) = √(mean(x²))

Two changes from LayerNorm:

  • No mean subtraction — skip centering entirely
  • No β shift — only one set of learned parameters (γ)

RMSNorm is faster and uses less memory. Empirically it matches LayerNorm quality while cutting the normalization compute nearly in half.

RMSNorm saves compute per layer. Multiply that saving across 96 layers and billions of tokens per training run, and the difference is substantial. This is why every frontier model since Llama adopted it.

Residual Connections

Residual Connections

Look at the + in x = x + attn(ln1(x)). That addition is a residual connection — and it is arguably the most important architectural decision in the transformer.

Adding a Delta, Not Replacing

Without the residual:

x = attn(ln1(x))   # sub-layer replaces x entirely

With the residual:

x = x + attn(ln1(x))   # sub-layer adds a change to x

The sub-layer only needs to learn what to change — the delta Δ. The original information flows through untouched. This is a much easier learning problem than reconstructing x from scratch.

Input xfull token vector
Sub-layer(x)computes delta Δ
x + Δresidual add
Outputoriginal + change

Gradient Flow: The Highway

Deep networks have a vanishing gradient problem. Backpropagation multiplies gradients through each layer — after 96 multiplications, the gradient reaching the first block can be nearly zero, and the early layers learn nothing.

Residual connections solve this by creating a gradient highway. The gradient can flow directly through the addition operation without any transformation — it passes through every block unchanged. Deep training becomes feasible.

This insight came from ResNets in 2015 (He et al.) — a computer vision architecture. The transformer borrowed it directly.

Analogy: Tracked Changes

Writing a document in "tracked changes" mode: every edit is layered on top of the original text, not replacing it. You can always see what changed and roll back. Each transformer block makes tracked changes to the representation — the original signal is never lost.

Without residual connections, training 96 layers would be nearly impossible — gradients would vanish long before reaching the early blocks. The residual connection is the reason that "more layers = more capable" actually works at scale.

Further reading

  • Deep Residual Learning (He et al., 2015) — the original ResNet paper that introduced residual connections
  • Understanding the Difficulty of Training Deep Feedforward Neural Networks (Glorot & Bengio, 2010) — the vanishing gradient problem explained
  • The Vanishing Gradient Problem (Wikipedia) — accessible overview of why deep networks struggle without skip connections

The Feed-Forward Network

The Feed-Forward Network

Attention and FFN do two very different jobs:

  • Attention = a group meeting. Every token talks to every other token and gathers information: "who is relevant to me?" After attention, the token for "sat" now knows about "cat" and "mat."
  • FFN = individual thinking. Each token goes back to its own desk and processes what it just learned — alone, with no further communication. The token for "sat" takes its gathered context and thinks: "given that I know about cat and mat, what should my updated representation be?"

Attention is communication. FFN is computation.

Expand → Activate → Contract

The FFN is a two-layer MLP with a specific shape pattern:

Input:    [batch, seq, 768]     # d_model
Expand:   [batch, seq, 3072]    # 4 × d_model
Activate: [batch, seq, 3072]    # non-linearity
Contract: [batch, seq, 768]     # back to d_model
Input [768]d_model dimensions
Expand [3072]× W₁ + b₁
GELUnon-linearity
Contract [768]× W₂ + b₂

Why Expand Then Contract?

The expanded space is a temporary workspace. With 3072 dimensions, the model can form rich intermediate combinations that couldn't be expressed in 768 dimensions. Then it contracts back — keeping only what's useful, discarding the rest.

Contraction back to d_model is not optional. It's what keeps the transformer shape-preserving: every block outputs the same dimensions it takes as input, so blocks can stack indefinitely.

FFN as Knowledge Storage

Research from Geva et al. (2021) showed that FFN layers function as key-value memory banks. The first weight matrix W₁ acts as keys — patterns to match. The second matrix W₂ acts as values — what information to retrieve when a pattern matches.

Factual associations like "Paris is the capital of France" are stored in FFN neurons. Attention routes information; FFN stores and retrieves it.

Activation Functions: ReLU → GELU → SwiGLU

The activation function sits between the expand and contract steps. It decides which signals pass through and which get suppressed. Toggle the curves below to compare:

-2-1123-3-2-1123ReLUGELUSwiGLU
ReLUmax(0, x) — sharp cutoff at 0
GELUsmooth curve — allows small negatives
SwiGLUx × sigmoid(x) — learned gating

Each generation solved a problem with the previous one:

  • ReLU (2012) — the breakthrough that made deep learning work. Everything negative becomes exactly zero: max(0, x). Fast and simple. But it has a flaw: once a neuron outputs negative, its gradient is exactly zero — the neuron "dies" and stops learning forever. In a 96-layer model, many neurons die.

  • GELU (2016, used in GPT/BERT) — fixes the "dying neuron" problem. Instead of a hard cutoff at zero, it smoothly curves — small negative values still get through slightly. This means gradients are never exactly zero, so every neuron keeps learning. Toggle ReLU off and GELU on in the graph above to see the smooth transition near x=0.

  • SwiGLU (2020, used in Llama/Mistral/PaLM) — fixes a different problem: in ReLU and GELU, the activation function is fixed — it applies the same transformation regardless of input. SwiGLU adds a learned gate: a second branch of weights decides which dimensions to keep and which to suppress, adapting per input. The model can learn "for this token, dimensions 5-10 matter; for that token, dimensions 50-60 matter." More parameters, but measurably better quality.

The FFN contains roughly 2/3 of a transformer's total parameters. In GPT-3, the four FFN weight matrices per block dominate the 175B parameter count. This layer is where the model's "knowledge" is stored — the attention mechanism retrieves context, but FFN is the library.

Pre-Norm vs Post-Norm

Pre-Norm vs Post-Norm

There are two ways to place Layer Normalization inside a residual block. This ordering looks like a minor implementation detail — at 96 layers deep, it determines whether training converges at all.

Two Orderings

Post-Norm (original 2017 paper):

x = LN(x + sublayer(x))   # normalize after residual add

Pre-Norm (modern standard):

x = x + sublayer(LN(x))   # normalize before sublayer

The difference: in Post-Norm, the normalization sits outside the residual addition and sees the combined signal. In Pre-Norm, normalization sits inside — the residual path bypasses it entirely.

Why Pre-Norm Won

In Pre-Norm, the residual connection carries the unmodified input directly to the addition — LN never touches the highway. Gradient magnitude is preserved across every block.

In Post-Norm, every gradient must pass through the normalization operation at each layer boundary. This destabilizes training at depth, requiring careful learning rate warm-up schedules and lower initial rates. At 96 layers, the instability compounds.

Pre-Norm trains reliably without special warm-up schedules. Every major modern LLM — GPT-2, GPT-3, Llama, Mistral, Falcon, Claude — uses Pre-Norm.

Karpathy's Code

In nanoGPT, Karpathy writes it directly:

x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))

ln_1 and ln_2 are called inside the sub-layer calls — LN runs first, then the sub-layer, then the residual add. That's Pre-Norm. The names ln_1 and ln_2 reflect that there are two norms per block, one per sub-layer.

Toggle the Pre-Norm / Post-Norm switch in the right panel to see where the normalization boxes appear in the data flow.

A trivial ordering difference — normalize before or after — but at 96 layers deep it determines whether the model trains at all. Pre-Norm is now the universal standard. Post-Norm is a historical curiosity.

Modern Variants & Scale

Modern Variants & Scale

The core transformer block has not changed since "Attention Is All You Need" in 2017. What has changed are the components inside it — each swap improving quality, speed, or both.

What Changed: 2017 → Today

Five components swapped out, same overall structure:

  • Normalization: LayerNorm (γ, β) → RMSNorm (γ only) — simpler, faster, same quality
  • FFN activation: ReLU → SwiGLU (gated) — learned gating, better quality
  • Position encoding: Sinusoidal (absolute) → RoPE (rotary, relative) — handles longer sequences
  • Attention: Standard (O(n²) memory) → Flash Attention (IO-aware) — 2-5× faster, less memory
  • Norm placement: Post-Norm → Pre-Norm — trains stably at 96+ layers

RoPE (Rotary Position Embedding) encodes relative position by rotating query and key vectors — it naturally extrapolates to longer sequences than seen during training. Covered in the Embeddings module.

Flash Attention is not a new algorithm — it computes the same attention scores — but it reorganizes memory access to tile computations within GPU SRAM instead of writing intermediate results to HBM. The result: 2–5× faster, far less memory. Every production LLM uses it.

Weight tying shares the embedding matrix and the output projection matrix (they are both d_vocab × d_model). The output projection maps the final hidden state back to logits over the vocabulary. Using the same weights as the input embedding eliminates one full copy of that large matrix — saving hundreds of millions of parameters in large models.

Parameter Counting

A single transformer block contains approximately 12d² parameters (where d is d_model):

  • Attention: Q, K, V, O projection matrices → ~4d²
  • FFN: two weight matrices with 4× expansion → ~8d²

For GPT-3 (d = 12288):

12 × 12288² ≈ 1.8B parameters per block
1.8B × 96 blocks ≈ 172B parameters

The remaining ~3B come from embeddings and layer norms — the blocks dominate.

The transformer block's architecture has been stable since 2017. What changed were the components within: RMSNorm for speed, SwiGLU for quality, RoPE for length generalization, Flash Attention for efficiency. The two-line structure remains unchanged.

Further Reading

  • Andrej Karpathy — nanoGPT — minimal GPT implementation; the model.py Block class is 20 lines
  • Jay Alammar — The Illustrated GPT-2 — visual walk-through of GPT-2's block structure and parameter shapes
  • Lilian Weng — The Transformer Family v2 — comprehensive survey of architectural variants through 2023
© 2026 Learn AI Visuallycraftsman@craftsmanapps.com