GPU Memory Hierarchy — Registers to HBM
The Memory Ladder
What is the GPU Memory Hierarchy?
The GPU memory hierarchy is a series of storage levels, each with a different tradeoff between speed and size. Data that's closer to the compute cores is faster to access but much smaller. Data that's farther away is larger but slower.
Understanding this hierarchy is the key to GPU performance — most "slow" GPU code is slow because it accesses the wrong memory level too often.
The Building Analogy
Think of GPU memory like floors in a building:
- Registers — your desk. Instant access, but you can only keep a few things on it.
- Shared memory — the office supply room down the hall. Fast walk, shared with your team (block).
- L2 cache — the floor's storage closet. Automatic — the system decides what goes here.
- HBM (Global memory) — the warehouse across town. Huge capacity, but driving there takes time.
- PCIe / NVLink — ordering from another city (CPU) or a neighbor's warehouse (another GPU).
Each level down: bigger capacity, slower access, farther from compute.
On the right panel: Click any level in the ladder to see its size, bandwidth, and latency. Toggle the GPU selector (A100/H100/B200) to see how specs change across generations. Use the "Where does it live?" buttons to see where model weights, KV cache, and activations are stored.
Why This Matters
The speed gap between levels is enormous:
- Registers: ~0 cycles (instant)
- Shared memory: ~5 cycles
- L2 cache: ~30 cycles
- HBM: ~400 cycles
That's a 400× difference between the fastest and the main memory. GPU optimization is fundamentally about keeping data in the fastest level possible and minimizing trips to HBM. The next four steps explore each level in detail.
Registers & Register Spill
Registers: The Fastest Memory
Each thread has its own private registers — the fastest memory on the GPU. When your kernel declares a variable like float x = 3.14;, that value lives in a register. Access is instant: zero extra cycles, because the compute units read directly from the register file.
Recall from Module 2: an SM (Streaming Multiprocessor) is the physical compute unit on the GPU — each SM runs multiple thread blocks. A warp is a group of 32 threads that execute in lockstep. The SM's register file is divided among all active warps, and each warp's registers stay permanently allocated — this is why warp switching has zero cost.
How Many Registers?
Modern GPUs give each thread up to 255 registers. That sounds like a lot, but the compiler uses registers for everything: local variables, intermediate calculations, loop counters, array indices. A complex kernel can easily use 60-80 registers per thread.
The Spill Problem
What happens when a kernel needs more registers than available? The overflow goes to local memory — which, despite the friendly name, is actually stored in HBM (global memory).
This is called register spill, and it's devastating for performance: a variable that should take 0 cycles to access now takes ~400 cycles.
Try it: On the right panel, toggle between "Normal (4 variables)" and "Too many (12 variables)." Watch the overflow spill to HBM and the latency spike from 0 to 400 cycles.
Avoiding Register Spill
- Keep kernels simple — fewer variables per thread means fewer registers needed
- Reuse variables — instead of declaring new ones, overwrite variables you no longer need
- Check register usage —
nvcc --ptxas-options=-vshows how many registers each kernel uses - Use shared memory — if register pressure is high, store some data in shared memory instead (Step 3)
The tradeoff: using fewer registers per thread allows more warps to fit on the SM (higher occupancy), which improves latency hiding (Module 2, Step 4).
Shared Memory & L1 Cache
Shared Memory: Fast and Programmable
Shared memory is a fast on-chip SRAM shared by all threads in a block. Unlike registers (private to each thread) or HBM (accessible by all threads on the GPU), shared memory is scoped to one block — threads in the same block can read and write to it, but threads in different blocks cannot.
It's declared with the __shared__ keyword:
__shared__ float tile[16][16];
Why use it? Data reuse. If multiple threads in a block need the same data from HBM, you load it once into shared memory and let everyone read from there. One HBM read instead of many.
The Cooperative Loading Pattern
The key pattern for shared memory is cooperative loading:
- Each thread loads one element from HBM into shared memory
- Call
__syncthreads()— a barrier that waits until every thread in the block has finished loading - All threads read from shared memory — fast, reusable, no HBM trips
Try it: On the right panel, click "Play animation" to watch the load → sync → compute phases. Notice how data flows from HBM to shared memory once, then threads read from shared memory many times. The bandwidth counter shows: 16 HBM reads → 1 HBM read.
The L1 Cache / Shared Memory Tradeoff
Here's something surprising: shared memory and L1 cache are the same physical SRAM. On modern NVIDIA GPUs, each SM has ~96-256 KB of on-chip SRAM that's dynamically partitioned between shared memory (you control) and L1 cache (automatic, hardware-managed).
Think of it like a desk with a fixed surface area. You can use more space for your notebook (shared memory — you decide what goes there) or leave it clear for papers the system automatically places there (L1 cache). But you can't have both at full size.
Try it: Drag the SRAM partition slider on the right panel. Watch how increasing shared memory shrinks the L1 cache and vice versa. The total always adds up to the same amount.
Why this matters: if your kernel uses lots of shared memory, the L1 cache shrinks, which can slow down other memory accesses that rely on caching. It's a tradeoff you need to be aware of when tuning performance.
HBM: Where Your Model Lives
HBM: Where Your Model Lives
HBM (High Bandwidth Memory) is the GPU's main memory. It's called "high bandwidth" because it's much faster than CPU RAM (~50 GB/s) — an A100's HBM delivers ~2 TB/s, and a B200 reaches ~8 TB/s.
But despite the name, HBM is still 10-100× slower than shared memory. It's the biggest memory on the GPU, but every access takes ~400 cycles. GPU optimization is largely about minimizing HBM accesses.
What Lives in HBM?
Everything big:
- Model weights — a 7B parameter model at FP16 (2 bytes per parameter) = 14 GB. A 70B model = 140 GB.
- KV cache — stores key/value pairs from previous tokens during generation. Grows with sequence length and batch size. Can easily reach 10-30 GB.
- Activations — intermediate results during the forward pass. Proportional to batch size and model width.
During computation, tiles of this data are loaded into shared memory or registers for fast access. This is the cooperative loading pattern from Step 3 — and the tiling optimization in Module 6.
Will My Model Fit?
This is the most practical question GPU engineers ask. Let's calculate:
Llama 2 7B at FP16:
- Weights: 7B × 2 bytes = 14 GB
- KV cache (seq=2048, batch=32): ~8 GB
- Activations: ~4 GB
- Total: ~26 GB → fits on A100 (80 GB) with room to spare
Llama 2 70B at FP16:
- Weights: 70B × 2 bytes = 140 GB
- That alone doesn't fit on a single A100 (80 GB) or H100 (80 GB)
- Options: multi-GPU (Distributed Training track) or quantization (LLM Internals track)
Try it: On the right panel, toggle between 7B/13B/70B and watch the stacked bar fill up. Drag the sequence length slider to see KV cache grow. Try 70B on A100 — it overflows. Switch to B200 (192 GB) — it fits.
HBM Bandwidth: The Real Bottleneck
For LLM inference, HBM bandwidth — not compute — is usually the bottleneck. During token generation (decode phase), the GPU reads the entire model weights for each token but does very little math per byte loaded. This makes decode memory-bandwidth-bound.
Module 4 (Roofline Model) will formalize this with the compute-bound vs memory-bound framework.
NVLink & PCIe
Connecting GPUs to the World
So far we've talked about memory inside one GPU. But how does the GPU connect to the CPU, and to other GPUs?
Two links matter:
PCIe: GPU ↔ CPU
PCIe (Peripheral Component Interconnect Express) connects the GPU to the CPU over the motherboard. This is how data gets onto the GPU in the first place — recall the Host-Device pattern from Module 2: cudaMemcpy copies data over this link.
PCIe 5.0 delivers ~128 GB/s (H100, B200). Older PCIe 4.0 delivers ~64 GB/s (A100). Fast for a bus, but slow compared to HBM (~2-8 TB/s). This is why you want to minimize CPU↔GPU data transfers.
NVLink: GPU ↔ GPU
NVLink is NVIDIA's direct GPU-to-GPU interconnect. Instead of routing through the CPU over PCIe, NVLink provides a high-bandwidth direct link:
- NVLink 3.0 (A100): 600 GB/s
- NVLink 4.0 (H100): 900 GB/s
- NVLink 5.0 (B200): 1,800 GB/s
That's ~14× faster than PCIe. For multi-GPU workloads like distributed training, this difference is critical — GPUs need to exchange gradients, activations, and KV cache shards constantly.
On the right panel: Hover over the PCIe and NVLink links to highlight them. Toggle between A100/H100/B200 to see how interconnect bandwidth has grown across generations. Notice how NVLink bandwidth has 3× between A100 and B200.
Why Multi-GPU Communication Matters
When a model is too large for one GPU (like a 70B model), it must be split across multiple GPUs. Those GPUs need to communicate constantly:
- Tensor parallelism — splits each layer across GPUs. Every forward pass requires exchanging partial results via NVLink.
- Pipeline parallelism — splits layers across GPUs. Activations flow from one GPU to the next.
- Data parallelism — each GPU has a full copy but processes different data. Gradients are averaged via AllReduce.
All of these patterns are bottlenecked by interconnect bandwidth. Faster NVLink = faster training. The Distributed Training track covers these patterns in depth.
This completes the memory hierarchy. The next module introduces the Roofline Model — a framework for answering "is my kernel compute-bound or memory-bound?"