What is the GPU & CUDA track?

Nine interactive modules covering why GPUs exist, the execution model, memory hierarchy, the roofline model, memory access patterns, tiling and matrix multiply, tensor cores, operator fusion, and Triton + torch.compile.

Do I need a GPU to take this track?

No — all modules are browser-based simulations.

Is this CUDA-specific or general GPU?

Both. The concepts are general GPU architecture; the code examples use CUDA and Triton.

What is arithmetic intensity?

Arithmetic intensity is the ratio of floating-point operations (FLOPs) to bytes moved from memory. High arithmetic intensity (like large matrix multiplications) tends to be compute-bound; low arithmetic intensity (like elementwise operations) tends to be memory-bound. An elementwise add of two arrays has intensity ~0.08 FLOPs/byte (two reads and one write per FLOP in FP32). A large matmul has intensity proportional to its shared dimension, easily reaching hundreds of FLOPs/byte. The same operation can shift from memory-bound to compute-bound by fusing with neighbors, tiling, or increasing batch size — which is why batching unlocks GPU throughput.

Is LLM inference compute-bound or memory-bound?

It depends on the phase. Prefill (processing the prompt) involves large matrix multiplications and is typically compute-bound. Decode (generating tokens one at a time) has very low arithmetic intensity and is memory-bandwidth-bound. At decode, each new token requires streaming the entire model's weights from HBM to compute a single matrix-vector product per layer — hundreds of GB of memory traffic for a handful of FLOPs. This split is why prefill throughput scales with Tensor Core FLOPs, while decode throughput scales with HBM bandwidth. It's also the foundation for optimizations like speculative decoding and continuous batching.

How do you use the roofline model?

Calculate an operation's arithmetic intensity (FLOPs ÷ bytes moved), then plot it on the roofline chart. If it falls on the sloped (memory) roof, optimize memory access patterns. If it falls on the flat (compute) roof, use Tensor Cores or lower precision. For a memory-bound kernel, fusion, better tiling, and increasing batch size all raise intensity and push the point toward the ridge. For a compute-bound kernel, you're already saturating FLOPs, so the only wins come from faster arithmetic (Tensor Cores, FP8) or algorithmic change. Measuring the actual performance against the ceiling also tells you how much upside remains.

What shifts an operation from memory-bound to compute-bound?

Increasing arithmetic intensity: larger batch sizes (more computation per weight loaded), operator fusion (fewer HBM round-trips), or tiling (data reuse in shared memory). Alternatively, using faster memory (newer HBM generations) raises the memory ceiling. A single-query decode step is deeply memory-bound, but batching 32 queries reuses each weight 32 times without extra HBM reads, often pushing the matmul into compute-bound territory. This is the fundamental reason continuous batching, tensor parallelism, and prefill-heavy workloads all see dramatic throughput wins: they trade latency for arithmetic intensity and move the workload up the roofline.

Roofline Model — Compute-Bound vs Memory-Bound

The Bottleneck Question

What is the Roofline Model?

The roofline model is a visual framework for understanding what limits your GPU kernel's performance. Every kernel hits one of two ceilings: compute (how fast the cores can crunch numbers) or memory bandwidth (how fast data arrives from HBM).

Arithmetic intensity — the ratio of computation to memory traffic — determines which ceiling you hit. It's the single most important number for predicting GPU performance.

The Factory Analogy

Imagine a factory with two constraints:

Workers can do 1,000 assembly steps per hour (compute capacity)
The conveyor belt delivers 100 parts per hour (memory bandwidth)

If each part requires 1 assembly step, workers finish each part instantly and idle waiting for the next delivery. The conveyor belt is the bottleneck. You're memory-bound (intensity = 1 step/part).

If each part requires 100 assembly steps (cutting, welding, polishing...), the conveyor easily keeps up — parts pile up waiting for workers to finish. Workers are the bottleneck. You're compute-bound (intensity = 100 steps/part).

The ratio of work per part delivered is your arithmetic intensity. High ratio → compute-bound. Low ratio → memory-bound.

The Formula

FLOPs (Floating-point Operations) are the basic math operations GPUs perform — additions and multiplications on decimal numbers. More FLOPs = more computation.

Arithmetic intensity is:

Arithmetic Intensity = FLOPs ÷ Bytes loaded

Example: Matrix multiply C = A × B (both 1024 × 1024, FP32)

1024×1024

FLOPs

2 × 1024³ = 2.1 billion

each output: 1024 multiply-adds

Bytes

3 × 1024² × 4B = 12.6 MB

load A + load B + store C (FP32)

≈ 170 FLOPs/byteCompute-bound

Example: Vector addition C[i] = A[i] + B[i]

a₀

a₁

a₂

…

A[i]

b₀

b₁

b₂

…

B[i]

↓ per element

c₀

c₁

c₂

…

C[i] = A[i] + B[i]

FLOPs

1 per element

just one addition

Bytes

12 per element

read A (4B) + read B (4B) + write C (4B)

≈ 0.08 FLOPs/byteMemory-bound

Same GPU. Same code pattern. Completely different bottleneck.

On the right panel: Toggle between "Matrix Multiply" and "Vector Add." Watch how the compute and memory bars flip — and how the arithmetic intensity changes from 170 to 0.08.

Reading the Roofline Plot

The Roofline Chart

Recall that FLOPs are a count — how many operations. GFLOPS (Giga-FLOPS) measures speed — billions of floating-point operations per second. A GPU rated at 312 TFLOPS can do 312 trillion FLOPs every second.

The roofline model plots achievable performance (y-axis, in GFLOPS) against arithmetic intensity (x-axis, in FLOPs/byte). Both axes use a log scale.

Two lines form the "roof":

The Memory Ramp

When arithmetic intensity is low, memory bandwidth limits you. Performance scales linearly with arithmetic intensity:

Performance = Bandwidth × Arithmetic Intensity

On a log-log chart, this is a straight line with slope 1. The steeper (higher bandwidth), the faster your memory-bound operations run.

The Compute Ceiling

When arithmetic intensity is high enough, you've saturated the compute units. More data reuse doesn't help — you're already crunching as fast as the hardware allows:

Performance = Peak FLOPS

This is a flat horizontal line. No matter how high the arithmetic intensity goes, you can't exceed peak.

The Ridge Point

Where the two lines meet is the ridge point — the minimum arithmetic intensity needed to fully utilize the compute hardware.

For the A100: 312 TFLOPS ÷ 2.0 TB/s = 156 FLOPs/byte.

Left of the ridge (intensity < 156): memory-bound. Optimize memory access patterns, use faster memory.
Right of the ridge (intensity > 156): compute-bound. Use Tensor Cores, lower precision.

On the right panel: Watch the roofline chart build up step by step — first the axes, then the memory ramp, compute ceiling, and ridge point. After it builds, hover anywhere on the chart to see the arithmetic intensity and performance values at that point.

Where ML Operations Fall

Where Do ML Operations Fall?

Now that you can read the roofline chart, let's place real ML operations on it. Each operation has a different arithmetic intensity — and therefore a different bottleneck.

Matrix Multiply (≈ 170 FLOPs/byte)

The backbone of neural networks. For a weight matrix W (D×D) and input X (B×D):

FLOPs: 2 × B × D × D (multiply-accumulate for every output element)
Bytes: load W + load X + store output

High data reuse — each loaded weight participates in many multiplications. With typical batch sizes, matmul lands solidly in the compute-bound region.

Attention QK^T (≈ 114 FLOPs/byte)

Another matrix multiply — Q (B×D) times K^T (D×B). Arithmetic intensity depends on sequence length: longer sequences mean larger matrices and higher intensity. Generally compute-bound for typical sequence lengths.

Softmax (≈ 5 FLOPs/byte)

Reads an entire row, computes exp() for each element, sums them, divides. A handful of FLOPs per element, but reads and writes every value. Memory-bound.

LayerNorm (≈ 3 FLOPs/byte)

Similar to softmax: reads the row, computes mean and variance, normalizes. Few FLOPs per byte of data moved. Memory-bound.

Elementwise / ReLU (≈ 0.08 FLOPs/byte)

The extreme case: 1 FLOP per element (a comparison), but must read 4 bytes and write 4 bytes. Almost all time is spent waiting for memory. Extremely memory-bound.

Try it: On the right panel, select each operation and click where you think it falls on the chart — memory-bound side or compute-bound side. The chart will tell you if you're right and place the dot at the correct position.

The Pattern

Most ML operations are memory-bound. Only matrix multiplication (and large attention) are compute-bound. This is why memory bandwidth matters so much for GPU performance — and why techniques like operator fusion (Module 8) exist to reduce unnecessary memory traffic.

Prefill vs Decode

Same Model, Different Bottleneck

Here's a surprising fact: the same model layer can be compute-bound or memory-bound — depending on whether it's processing a prompt or generating a token.

Recall from the LLM Internals track: prefill processes all input tokens at once (e.g., your 512-token prompt), while decode generates one new token at a time.

Prefill: Compute-Bound

During prefill, the input to each weight matrix is a batch of 512 token vectors. The matrix multiply has:

Input: 512 × 4096 (prompt tokens)

Weights: 4096 × 4096

FLOPs: 2 × 512 × 4096 × 4096 ≈ 17 billion

Bytes: ~100 MB (weights + input + output)

Intensity ≈170 FLOPs/byte → Compute-bound

The 512-token batch means lots of FLOPs per byte of weights loaded. Prefill benefits from more TFLOPS (faster compute).

Decode: Memory-Bound

During decode, only one new token is generated at a time. The same weight matrix now sees:

Input: 1 × 4096 (single token)

Weights: 4096 × 4096 (same as before)

FLOPs: 2 × 1 × 4096 × 4096 ≈ 34 million

Bytes: ~67 MB (weights dominate)

Intensity ≈0.5 FLOPs/byte → Memory-bound

You load the entire weight matrix (~67 MB) to produce a tiny output. Almost all time is spent waiting for HBM. Decode benefits from more bandwidth (faster memory).

Why This Matters

This explains why different hardware generations help different phases:

GPU Generation Comparison

A100ridge: 156 FLOPs/byte

Compute

312 TFLOPS

Memory BW

2 TB/s

H100ridge: 295 FLOPs/byte

Compute

989 TFLOPS3.2×

Memory BW

3.35 TB/s1.7×

Compute grew faster → prefill benefits more

B200ridge: 563 FLOPs/byte

Compute

4500 TFLOPS4.6×

Memory BW

8 TB/s2.4×

Compute grew faster → prefill benefits more

When compute grows faster than bandwidth (A100 → H100), prefill benefits more. When bandwidth catches up (H100 → B200), decode gets a major speedup.

Batching (from the LLM Internals track) is the other lever: serving 32 users simultaneously turns batch size from 1 to 32, raising arithmetic intensity and moving decode closer to the ridge point.

On the right panel: Compare the two charts side by side — same operations, dramatically different positions. Notice how matmul drops from the compute-bound region (right) to deeply memory-bound (left) between prefill and decode.

Changing the Roof

What Changes When You Upgrade?

The roofline chart isn't fixed — it changes with the hardware. Understanding how the roof shifts helps you predict which workloads benefit most from a GPU upgrade.

Moving the Compute Ceiling

More TFLOPS → the flat ceiling moves up. Operations that were compute-bound get faster. But the ridge point also moves right:

Ridge = Peak FLOPS ÷ Bandwidth

If TFLOPS grows faster than bandwidth, the ridge point moves right — meaning more operations become memory-bound. This has been the trend across GPU generations.

Moving the Memory Ramp

More bandwidth → the sloped line steepens. Memory-bound operations get faster. The ridge point moves left, potentially shifting some operations from memory-bound to compute-bound.

Across GPU Generations

GPU	TFLOPS (FP16)	HBM BW	Ridge Point
A100	312	2.0 TB/s	156
H100	989	3.35 TB/s	295
B200	4,500	8.0 TB/s	563

Notice: the ridge point keeps moving right — 156 → 295 → 563. GPUs are getting more compute-heavy relative to bandwidth. More operations become memory-bound with each generation.

The Batch-Size Lever

The most powerful knob you have isn't hardware — it's batch size. Increasing batch size raises arithmetic intensity because you reuse the same weights across more inputs.

Try it: On the right panel, drag the batch-size slider from 1 (decode) to 512 (prefill). Watch the dot cross the ridge point — the moment it transitions from memory-bound to compute-bound. Then switch GPUs and see how the ridge point shifts.

Beyond the Basic Roofline

The two-line roofline model we've learned is a simplified but powerful tool. In practice, there are additional layers:

Multiple memory roofs: L2 cache and shared memory have their own (higher) bandwidth ceilings — Module 6 (Tiling) shows how to exploit this.
Tensor Core ceilings: Tensor Cores provide a separate, higher compute ceiling for matrix math — Module 7 covers this.
Quantization: Reducing precision (FP16 → INT8 → INT4) lowers bytes per weight, effectively raising arithmetic intensity. This is covered in the LLM Internals track's Quantization module.

The roofline model is the mental framework you'll use throughout the rest of this track and beyond.