What is the GPU & CUDA track?

Nine interactive modules covering why GPUs exist, the execution model, memory hierarchy, the roofline model, memory access patterns, tiling and matrix multiply, tensor cores, operator fusion, and Triton + torch.compile.

Do I need a GPU to take this track?

No — all modules are browser-based simulations.

Is this CUDA-specific or general GPU?

Both. The concepts are general GPU architecture; the code examples use CUDA and Triton.

What is HBM (High Bandwidth Memory)?

HBM is the GPU's main memory, offering ~2-3 TB/s bandwidth (A100: 2 TB/s, H100: 3.35 TB/s). It stores model weights, activations, and KV cache. Despite the name 'high bandwidth,' it's still 10-100x slower than on-chip shared memory. HBM is physically stacked DRAM bonded to the GPU via a silicon interposer, which is why capacity tops out at 80–141 GB per GPU (H100, H200, B100). For memory-bound workloads like LLM decode, the HBM bandwidth — not FLOPs — is the real ceiling, and every generation gains throughput mainly by moving to faster HBM.

What is shared memory in CUDA?

Shared memory is a fast on-chip SRAM (~19 TB/s bandwidth) shared by all threads in a block. It's explicitly managed by the programmer using the __shared__ keyword and is used for data reuse patterns like tiling. Each SM has ~100-228 KB of shared memory. Shared memory physically shares SRAM with L1 cache, and the split between them is configurable (carveout). It's split into 32 banks for parallel access, which is why bank conflicts matter. FlashAttention, tiled matmul, and nearly every hand-tuned CUDA kernel rely on staging data through shared memory to avoid repeated HBM reads.

NVLink is NVIDIA's high-bandwidth interconnect for GPU-to-GPU communication, providing ~900 GB/s per link (H100). It's ~14x faster than PCIe Gen5 (~64 GB/s) and is essential for multi-GPU training and inference with tensor parallelism. An NVLink Switch system can connect 8 H100s in a fully-connected mesh with 900 GB/s between any pair, which is what makes tensor parallelism practical for 70B+ models. Without NVLink, collective operations like AllReduce become the bottleneck of multi-GPU training, so every serious LLM cluster relies on NVLink within a node and InfiniBand between nodes.

Where are LLM weights and KV cache stored on a GPU?

Model weights and KV cache are stored in HBM (global memory). During computation, tiles of data are loaded into shared memory or registers for fast access. This is why memory bandwidth is often the bottleneck for LLM inference. A 70B model in FP16 occupies ~140 GB — enough to fill two 80 GB H100s — and every decode step streams a meaningful fraction of those weights from HBM. Quantization (INT8, INT4) shrinks the weight stream proportionally, which is why quantized decode latency often improves faster than raw compute theory predicts.

GPU Memory Hierarchy — Registers to HBM

The Memory Ladder

What is the GPU Memory Hierarchy?

The GPU memory hierarchy is a series of storage levels, each with a different tradeoff between speed and size. Data that's closer to the compute cores is faster to access but much smaller. Data that's farther away is larger but slower.

Understanding this hierarchy is the key to GPU performance — most "slow" GPU code is slow because it accesses the wrong memory level too often.

The Building Analogy

Think of GPU memory like floors in a building:

Registers — your desk. Instant access, but you can only keep a few things on it.
Shared memory — the office supply room down the hall. Fast walk, shared with your team (block).
L2 cache — the floor's storage closet. Automatic — the system decides what goes here.
HBM (Global memory) — the warehouse across town. Huge capacity, but driving there takes time.
PCIe / NVLink — ordering from another city (CPU) or a neighbor's warehouse (another GPU).

Each level down: bigger capacity, slower access, farther from compute.

On the right panel: Click any level in the ladder to see its size, bandwidth, and latency. Toggle the GPU selector (A100/H100/B200) to see how specs change across generations. Use the "Where does it live?" buttons to see where model weights, KV cache, and activations are stored.

Why This Matters

The speed gap between levels is enormous:

Registers: ~0 cycles (instant)
Shared memory: ~5 cycles
L2 cache: ~30 cycles
HBM: ~400 cycles

That's a 400× difference between the fastest and the main memory. GPU optimization is fundamentally about keeping data in the fastest level possible and minimizing trips to HBM. The next four steps explore each level in detail.

Registers & Register Spill

Registers: The Fastest Memory

Each thread has its own private registers — the fastest memory on the GPU. When your kernel declares a variable like float x = 3.14;, that value lives in a register. Access is instant: zero extra cycles, because the compute units read directly from the register file.

Recall from Module 2: an SM (Streaming Multiprocessor) is the physical compute unit on the GPU — each SM runs multiple thread blocks. A warp is a group of 32 threads that execute in lockstep. The SM's register file is divided among all active warps, and each warp's registers stay permanently allocated — this is why warp switching has zero cost.

How Many Registers?

Modern GPUs give each thread up to 255 registers. That sounds like a lot, but the compiler uses registers for everything: local variables, intermediate calculations, loop counters, array indices. A complex kernel can easily use 60-80 registers per thread.

The Spill Problem

What happens when a kernel needs more registers than available? The overflow goes to local memory — which, despite the friendly name, is actually stored in HBM (global memory).

This is called register spill, and it's devastating for performance: a variable that should take 0 cycles to access now takes ~400 cycles.

Try it: On the right panel, toggle between "Normal (4 variables)" and "Too many (12 variables)." Watch the overflow spill to HBM and the latency spike from 0 to 400 cycles.

Avoiding Register Spill

Keep kernels simple — fewer variables per thread means fewer registers needed
Reuse variables — instead of declaring new ones, overwrite variables you no longer need
Check register usage — nvcc --ptxas-options=-v shows how many registers each kernel uses
Use shared memory — if register pressure is high, store some data in shared memory instead (Step 3)

The tradeoff: using fewer registers per thread allows more warps to fit on the SM (higher occupancy), which improves latency hiding (Module 2, Step 4).

Shared Memory & L1 Cache

Shared Memory: Fast and Programmable

Shared memory is a fast on-chip SRAM shared by all threads in a block. Unlike registers (private to each thread) or HBM (accessible by all threads on the GPU), shared memory is scoped to one block — threads in the same block can read and write to it, but threads in different blocks cannot.

It's declared with the __shared__ keyword:

__shared__ float tile[16][16];

Why use it? Data reuse. If multiple threads in a block need the same data from HBM, you load it once into shared memory and let everyone read from there. One HBM read instead of many.

The Cooperative Loading Pattern

The key pattern for shared memory is cooperative loading:

Each thread loads one element from HBM into shared memory
Call __syncthreads() — a barrier that waits until every thread in the block has finished loading
All threads read from shared memory — fast, reusable, no HBM trips

Try it: On the right panel, click "Play animation" to watch the load → sync → compute phases. Notice how data flows from HBM to shared memory once, then threads read from shared memory many times. The bandwidth counter shows: 16 HBM reads → 1 HBM read.

The L1 Cache / Shared Memory Tradeoff

Here's something surprising: shared memory and L1 cache are the same physical SRAM. On modern NVIDIA GPUs, each SM has ~96-256 KB of on-chip SRAM that's dynamically partitioned between shared memory (you control) and L1 cache (automatic, hardware-managed).

Think of it like a desk with a fixed surface area. You can use more space for your notebook (shared memory — you decide what goes there) or leave it clear for papers the system automatically places there (L1 cache). But you can't have both at full size.

Try it: Drag the SRAM partition slider on the right panel. Watch how increasing shared memory shrinks the L1 cache and vice versa. The total always adds up to the same amount.

Why this matters: if your kernel uses lots of shared memory, the L1 cache shrinks, which can slow down other memory accesses that rely on caching. It's a tradeoff you need to be aware of when tuning performance.

HBM: Where Your Model Lives

HBM (High Bandwidth Memory) is the GPU's main memory. It's called "high bandwidth" because it's much faster than CPU RAM (~50 GB/s) — an A100's HBM delivers ~2 TB/s, and a B200 reaches ~8 TB/s.

But despite the name, HBM is still 10-100× slower than shared memory. It's the biggest memory on the GPU, but every access takes ~400 cycles. GPU optimization is largely about minimizing HBM accesses.

What Lives in HBM?

Everything big:

Model weights — a 7B parameter model at FP16 (2 bytes per parameter) = 14 GB. A 70B model = 140 GB.
KV cache — stores key/value pairs from previous tokens during generation. Grows with sequence length and batch size. Can easily reach 10-30 GB.
Activations — intermediate results during the forward pass. Proportional to batch size and model width.

During computation, tiles of this data are loaded into shared memory or registers for fast access. This is the cooperative loading pattern from Step 3 — and the tiling optimization in Module 6.

Will My Model Fit?

This is the most practical question GPU engineers ask. Let's calculate:

Llama 2 7B at FP16:

Weights: 7B × 2 bytes = 14 GB
KV cache (seq=2048, batch=32): ~8 GB
Activations: ~4 GB
Total: ~26 GB → fits on A100 (80 GB) with room to spare

Llama 2 70B at FP16:

Weights: 70B × 2 bytes = 140 GB
That alone doesn't fit on a single A100 (80 GB) or H100 (80 GB)
Options: multi-GPU (Distributed Training track) or quantization (LLM Internals track)

Try it: On the right panel, toggle between 7B/13B/70B and watch the stacked bar fill up. Drag the sequence length slider to see KV cache grow. Try 70B on A100 — it overflows. Switch to B200 (192 GB) — it fits.

HBM Bandwidth: The Real Bottleneck

For LLM inference, HBM bandwidth — not compute — is usually the bottleneck. During token generation (decode phase), the GPU reads the entire model weights for each token but does very little math per byte loaded. This makes decode memory-bandwidth-bound.

Module 4 (Roofline Model) will formalize this with the compute-bound vs memory-bound framework.

NVLink & PCIe

Connecting GPUs to the World

So far we've talked about memory inside one GPU. But how does the GPU connect to the CPU, and to other GPUs?

Two links matter:

PCIe: GPU ↔ CPU

PCIe (Peripheral Component Interconnect Express) connects the GPU to the CPU over the motherboard. This is how data gets onto the GPU in the first place — recall the Host-Device pattern from Module 2: cudaMemcpy copies data over this link.

PCIe 5.0 delivers ~128 GB/s (H100, B200). Older PCIe 4.0 delivers ~64 GB/s (A100). Fast for a bus, but slow compared to HBM (~2-8 TB/s). This is why you want to minimize CPU↔GPU data transfers.

NVLink: GPU ↔ GPU

NVLink is NVIDIA's direct GPU-to-GPU interconnect. Instead of routing through the CPU over PCIe, NVLink provides a high-bandwidth direct link:

NVLink 3.0 (A100): 600 GB/s
NVLink 4.0 (H100): 900 GB/s
NVLink 5.0 (B200): 1,800 GB/s

That's ~14× faster than PCIe. For multi-GPU workloads like distributed training, this difference is critical — GPUs need to exchange gradients, activations, and KV cache shards constantly.

On the right panel: Hover over the PCIe and NVLink links to highlight them. Toggle between A100/H100/B200 to see how interconnect bandwidth has grown across generations. Notice how NVLink bandwidth has 3× between A100 and B200.

Why Multi-GPU Communication Matters

When a model is too large for one GPU (like a 70B model), it must be split across multiple GPUs. Those GPUs need to communicate constantly:

Tensor parallelism — splits each layer across GPUs. Every forward pass requires exchanging partial results via NVLink.
Pipeline parallelism — splits layers across GPUs. Activations flow from one GPU to the next.
Data parallelism — each GPU has a full copy but processes different data. Gradients are averaged via AllReduce.

All of these patterns are bottlenecked by interconnect bandwidth. Faster NVLink = faster training. The Distributed Training track covers these patterns in depth.

This completes the memory hierarchy. The next module introduces the Roofline Model — a framework for answering "is my kernel compute-bound or memory-bound?"