Learn AI VisuallyTracksAI Explained

Roofline Model — Compute-Bound vs Memory-Bound

The Bottleneck Question

What is the Roofline Model?

The roofline model is a visual framework for understanding what limits your GPU kernel's performance. Every kernel hits one of two ceilings: compute (how fast the cores can crunch numbers) or memory bandwidth (how fast data arrives from HBM).

Arithmetic intensity — the ratio of computation to memory traffic — determines which ceiling you hit. It's the single most important number for predicting GPU performance.

Arithmetic Intensity (FLOPs/byte)Performanceridge pointMemory-boundCompute-bound

The Factory Analogy

Imagine a factory with two constraints:

  • Workers can do 1,000 assembly steps per hour (compute capacity)
  • The conveyor belt delivers 100 parts per hour (memory bandwidth)

If each part requires 1 assembly step, workers finish each part instantly and idle waiting for the next delivery. The conveyor belt is the bottleneck. You're memory-bound (intensity = 1 step/part).

If each part requires 100 assembly steps (cutting, welding, polishing...), the conveyor easily keeps up — parts pile up waiting for workers to finish. Workers are the bottleneck. You're compute-bound (intensity = 100 steps/part).

The ratio of work per part delivered is your arithmetic intensity. High ratio → compute-bound. Low ratio → memory-bound.

The Formula

FLOPs (Floating-point Operations) are the basic math operations GPUs perform — additions and multiplications on decimal numbers. More FLOPs = more computation.

Arithmetic intensity is:

Arithmetic Intensity = FLOPs ÷ Bytes loaded

Example: Matrix multiply C = A × B (both 1024 × 1024, FP32)

A
1024×1024
×
B
1024×1024
=
C
1024×1024
FLOPs
2 × 1024³ = 2.1 billion
each output: 1024 multiply-adds
Bytes
3 × 1024² × 4B = 12.6 MB
load A + load B + store C (FP32)
AI
≈ 170 FLOPs/byteCompute-bound

Example: Vector addition C[i] = A[i] + B[i]

a₀
a₁
a₂
…
A[i]
+
b₀
b₁
b₂
…
B[i]
↓ per element
c₀
c₁
c₂
…
C[i] = A[i] + B[i]
FLOPs
1 per element
just one addition
Bytes
12 per element
read A (4B) + read B (4B) + write C (4B)
AI
≈ 0.08 FLOPs/byteMemory-bound

Same GPU. Same code pattern. Completely different bottleneck.

On the right panel: Toggle between "Matrix Multiply" and "Vector Add." Watch how the compute and memory bars flip — and how the arithmetic intensity changes from 170 to 0.08.

Reading the Roofline Plot

The Roofline Chart

Recall that FLOPs are a count — how many operations. GFLOPS (Giga-FLOPS) measures speed — billions of floating-point operations per second. A GPU rated at 312 TFLOPS can do 312 trillion FLOPs every second.

The roofline model plots achievable performance (y-axis, in GFLOPS) against arithmetic intensity (x-axis, in FLOPs/byte). Both axes use a log scale.

Two lines form the "roof":

The Memory Ramp

When arithmetic intensity is low, memory bandwidth limits you. Performance scales linearly with arithmetic intensity:

Performance = Bandwidth × Arithmetic Intensity

On a log-log chart, this is a straight line with slope 1. The steeper (higher bandwidth), the faster your memory-bound operations run.

The Compute Ceiling

When arithmetic intensity is high enough, you've saturated the compute units. More data reuse doesn't help — you're already crunching as fast as the hardware allows:

Performance = Peak FLOPS

This is a flat horizontal line. No matter how high the arithmetic intensity goes, you can't exceed peak.

The Ridge Point

Where the two lines meet is the ridge point — the minimum arithmetic intensity needed to fully utilize the compute hardware.

For the A100: 312 TFLOPS ÷ 2.0 TB/s = 156 FLOPs/byte.

  • Left of the ridge (intensity < 156): memory-bound. Optimize memory access patterns, use faster memory.
  • Right of the ridge (intensity > 156): compute-bound. Use Tensor Cores, lower precision.

On the right panel: Watch the roofline chart build up step by step — first the axes, then the memory ramp, compute ceiling, and ridge point. After it builds, hover anywhere on the chart to see the arithmetic intensity and performance values at that point.

Where ML Operations Fall

Where Do ML Operations Fall?

Now that you can read the roofline chart, let's place real ML operations on it. Each operation has a different arithmetic intensity — and therefore a different bottleneck.

Matrix Multiply (≈ 170 FLOPs/byte)

The backbone of neural networks. For a weight matrix W (D×D) and input X (B×D):

  • FLOPs: 2 × B × D × D (multiply-accumulate for every output element)
  • Bytes: load W + load X + store output

High data reuse — each loaded weight participates in many multiplications. With typical batch sizes, matmul lands solidly in the compute-bound region.

Attention QK^T (≈ 114 FLOPs/byte)

Another matrix multiply — Q (B×D) times K^T (D×B). Arithmetic intensity depends on sequence length: longer sequences mean larger matrices and higher intensity. Generally compute-bound for typical sequence lengths.

Softmax (≈ 5 FLOPs/byte)

Reads an entire row, computes exp() for each element, sums them, divides. A handful of FLOPs per element, but reads and writes every value. Memory-bound.

LayerNorm (≈ 3 FLOPs/byte)

Similar to softmax: reads the row, computes mean and variance, normalizes. Few FLOPs per byte of data moved. Memory-bound.

Elementwise / ReLU (≈ 0.08 FLOPs/byte)

The extreme case: 1 FLOP per element (a comparison), but must read 4 bytes and write 4 bytes. Almost all time is spent waiting for memory. Extremely memory-bound.

Try it: On the right panel, select each operation and click where you think it falls on the chart — memory-bound side or compute-bound side. The chart will tell you if you're right and place the dot at the correct position.

The Pattern

Most ML operations are memory-bound. Only matrix multiplication (and large attention) are compute-bound. This is why memory bandwidth matters so much for GPU performance — and why techniques like operator fusion (Module 8) exist to reduce unnecessary memory traffic.

Prefill vs Decode

Same Model, Different Bottleneck

Here's a surprising fact: the same model layer can be compute-bound or memory-bound — depending on whether it's processing a prompt or generating a token.

Recall from the LLM Internals track: prefill processes all input tokens at once (e.g., your 512-token prompt), while decode generates one new token at a time.

Prefill: Compute-Bound

During prefill, the input to each weight matrix is a batch of 512 token vectors. The matrix multiply has:

Input: 512 × 4096 (prompt tokens)
Weights: 4096 × 4096
FLOPs: 2 × 512 × 4096 × 4096 ≈ 17 billion
Bytes: ~100 MB (weights + input + output)
Intensity ≈170 FLOPs/byte → Compute-bound

The 512-token batch means lots of FLOPs per byte of weights loaded. Prefill benefits from more TFLOPS (faster compute).

Decode: Memory-Bound

During decode, only one new token is generated at a time. The same weight matrix now sees:

Input: 1 × 4096 (single token)
Weights: 4096 × 4096 (same as before)
FLOPs: 2 × 1 × 4096 × 4096 ≈ 34 million
Bytes: ~67 MB (weights dominate)
Intensity ≈0.5 FLOPs/byte → Memory-bound

You load the entire weight matrix (~67 MB) to produce a tiny output. Almost all time is spent waiting for HBM. Decode benefits from more bandwidth (faster memory).

Why This Matters

This explains why different hardware generations help different phases:

GPU Generation Comparison
A100ridge: 156 FLOPs/byte
Compute
312 TFLOPS
Memory BW
2 TB/s
H100ridge: 295 FLOPs/byte
Compute
989 TFLOPS3.2×
Memory BW
3.35 TB/s1.7×
Compute grew faster → prefill benefits more
B200ridge: 563 FLOPs/byte
Compute
4500 TFLOPS4.6×
Memory BW
8 TB/s2.4×
Compute grew faster → prefill benefits more

When compute grows faster than bandwidth (A100 → H100), prefill benefits more. When bandwidth catches up (H100 → B200), decode gets a major speedup.

Batching (from the LLM Internals track) is the other lever: serving 32 users simultaneously turns batch size from 1 to 32, raising arithmetic intensity and moving decode closer to the ridge point.

On the right panel: Compare the two charts side by side — same operations, dramatically different positions. Notice how matmul drops from the compute-bound region (right) to deeply memory-bound (left) between prefill and decode.

Changing the Roof

What Changes When You Upgrade?

The roofline chart isn't fixed — it changes with the hardware. Understanding how the roof shifts helps you predict which workloads benefit most from a GPU upgrade.

Moving the Compute Ceiling

More TFLOPS → the flat ceiling moves up. Operations that were compute-bound get faster. But the ridge point also moves right:

Ridge = Peak FLOPS ÷ Bandwidth

If TFLOPS grows faster than bandwidth, the ridge point moves right — meaning more operations become memory-bound. This has been the trend across GPU generations.

Moving the Memory Ramp

More bandwidth → the sloped line steepens. Memory-bound operations get faster. The ridge point moves left, potentially shifting some operations from memory-bound to compute-bound.

Across GPU Generations

GPUTFLOPS (FP16)HBM BWRidge Point
A1003122.0 TB/s156
H1009893.35 TB/s295
B2004,5008.0 TB/s563

Notice: the ridge point keeps moving right — 156 → 295 → 563. GPUs are getting more compute-heavy relative to bandwidth. More operations become memory-bound with each generation.

The Batch-Size Lever

The most powerful knob you have isn't hardware — it's batch size. Increasing batch size raises arithmetic intensity because you reuse the same weights across more inputs.

Try it: On the right panel, drag the batch-size slider from 1 (decode) to 512 (prefill). Watch the dot cross the ridge point — the moment it transitions from memory-bound to compute-bound. Then switch GPUs and see how the ridge point shifts.

Beyond the Basic Roofline

The two-line roofline model we've learned is a simplified but powerful tool. In practice, there are additional layers:

  • Multiple memory roofs: L2 cache and shared memory have their own (higher) bandwidth ceilings — Module 6 (Tiling) shows how to exploit this.
  • Tensor Core ceilings: Tensor Cores provide a separate, higher compute ceiling for matrix math — Module 7 covers this.
  • Quantization: Reducing precision (FP16 → INT8 → INT4) lowers bytes per weight, effectively raising arithmetic intensity. This is covered in the LLM Internals track's Quantization module.

The roofline model is the mental framework you'll use throughout the rest of this track and beyond.

Frequently Asked Questions

© 2026 Learn AI Visuallycraftsman@craftsmanapps.com