What is the GPU & CUDA track?

Nine interactive modules covering why GPUs exist, the execution model, memory hierarchy, the roofline model, memory access patterns, tiling and matrix multiply, tensor cores, operator fusion, and Triton + torch.compile.

Do I need a GPU to take this track?

No — all modules are browser-based simulations.

Is this CUDA-specific or general GPU?

Both. The concepts are general GPU architecture; the code examples use CUDA and Triton.

Why does tiling improve GPU performance?

Tiling reduces redundant global memory reads. Without tiling, each matrix element is loaded N times. With tile size T, each element is loaded N/T times — increasing arithmetic intensity and shifting the kernel from memory-bound toward compute-bound on the roofline. Larger tiles yield better data reuse but require more shared memory and registers per block, which lowers occupancy. The optimal tile size balances reuse against occupancy and is architecture-specific — CUTLASS ships dozens of preset tile shapes so libraries like cuBLAS can dispatch the best one for your problem size and GPU generation.

What is tree reduction in CUDA?

Tree reduction is a parallel algorithm that sums N values in log₂(N) steps using shared memory. Each step, half the threads add pairs of values. It powers softmax (sum of exponentials), layernorm (mean and variance), and loss computation. For example, reducing 1,024 values takes 10 steps instead of 1,023 sequential adds. Modern kernels often use warp-level primitives like __shfl_down_sync for the last few steps, which run without shared-memory traffic. Tree reduction is also the reason softmax and layernorm are fusion barriers — they need the full row sum before the next op can proceed.

Why are two __syncthreads() needed in tiled matmul?

The first sync ensures all threads finish loading the tile before any thread reads it. The second ensures all threads finish reading before the next loop iteration overwrites shared memory with a new tile. Missing either causes race condition bugs — this is the #1 source of tiling bugs.

How does tiling relate to FlashAttention?

FlashAttention applies the same tiling principle to attention computation — keeping intermediate scores in SRAM instead of materializing the full N×N attention matrix in HBM. Tiling is the foundational GPU optimization pattern; FlashAttention is its most impactful application in LLMs. The twist is that softmax normally requires the full row before producing output, which would break tiling. FlashAttention solves this with online softmax — tracking a running max and sum per tile and rescaling earlier partial results when a larger max appears. Same FLOPs as standard attention, O(N²/M) fewer HBM accesses, and a 2–4× end-to-end speedup.

Tiling & Matrix Multiplication on GPUs

Naive Matrix Multiply

What is Tiled Matrix Multiplication?

Tiled matrix multiplication is a GPU optimization that loads small blocks (tiles) of input matrices into fast shared memory, so hundreds of threads can reuse the same data instead of each fetching it from slow global memory. This reduces memory traffic by the tile dimension — typically 16×, turning a memory-bound kernel into a near compute-bound one.

The Redundancy Problem

Imagine a factory where every worker walks to the warehouse to fetch the same parts — 32 workers need the same row of bolts, each making a separate trip. That's naive matmul: every thread independently reads from HBM.

C[2][3] reads row 2 of A and column 3 of B

■ row 2 of A■ col 3 of B■ C[2][3]

Each output cell C[i][j] computes a dot product: it reads the entire row i of A and the entire column j of B. For an N×N matrix, that's 2N reads per output cell — and there are N² output cells. Total: 2N³ reads from HBM.

But adjacent outputs share inputs. C[i][0] and C[i][1] both read the same row i of A. C[0][j] and C[1][j] both read column j of B. The data is already in HBM — it's being fetched over and over.

For N=1024: 2 × 1024³ = 2 billion float reads. Most of them redundant.

global void

naive_matmul(float* C, float* A, float* B, int N) {

int row = blockIdx.y * blockDim.y + threadIdx.y;

int col = blockIdx.x * blockDim.x + threadIdx.x;

float sum = 0.0f;

for (int k = 0; k < N; k++)

sum += A[row * N + k] * B[k * N + col];

C[row * N + col] = sum;

}

Each thread's inner loop reads N elements from A (green) and N from B (blue). Every thread in the same row reads the same row of A — but each fetches it independently from HBM.

On the right panel: Click two adjacent output cells in C. Notice how they share an entire row of A (highlighted yellow) — that shared data is fetched from HBM twice. Tiling eliminates this redundancy.

The Tiling Idea

Load Once, Reuse Many

What if workers pooled their fetches? One trip to the warehouse, parts placed on a shared workbench, everyone grabs what they need from there. That's tiling.

Tiling: load once into shared memory, reuse many times

Tile of A

16×16

Tile of B

16×16

→

Shared Mem

SRAM

read from SRAM

256 threads

HBM reads: 2 tiles (512 floats)

256 × full rows

Instead of each thread reading its own row/column from HBM, a thread block cooperatively loads a small tile (e.g., 16×16) of A and B into shared memory — fast, on-chip SRAM that all threads in the block can access. Then every thread reads from shared memory instead of HBM.

Recall from the Memory Hierarchy module: shared memory is ~100× faster than HBM, per-block, and explicitly managed with __shared__.

The Sliding Window

A single tile pair only covers part of the dot product. The trick: a dot product is just a sum, and sums can be split into chunks.

Consider computing C[2][3] in an 8×8 matrix with tile size T=4. The full dot product has 8 terms:

C[2][3] = A[2][0]·B[0][3] + A[2][1]·B[1][3] + A[2][2]·B[2][3] + A[2][3]·B[3][3] ← tile 0 (k=0..3)

+ A[2][4]·B[4][3] + A[2][5]·B[5][3] + A[2][6]·B[6][3] + A[2][7]·B[7][3] ← tile 1 (k=4..7)

Each tile covers T=4 terms. The kernel's sum variable accumulates across tiles — it's never reset, just added to. After N/T = 2 tiles, all 8 terms have been summed. This works because addition is associative — you can group terms any way you want and the result is the same.

The sliding window process:

Load tile pair (A tile + B tile) from HBM → shared memory
Compute partial dot products from shared memory → add to sum
Slide to the next tile pair along K
Repeat N/T times

After all tiles: sum holds the complete dot product.

The Payoff

Without tiling: each element of A is loaded N times (once for each output column that needs it).

With tile size T: each element is loaded N/T times. For T=16, that's a 16× reduction in HBM traffic. Same computation, dramatically less memory movement.

On the right panel: Watch the tile slide across the K dimension. The traffic counter at the bottom compares naive vs tiled — same output, dramatically fewer HBM reads.

Tiled Matmul Kernel

The Complete Tiled Matmul Kernel

Here's the complete kernel — the most CUDA-dense code in this track. Every line earns its place. We use 8×8 in the visualization, but real kernels use TILE=16 or 32.

#define TILE 16

global void

tiled_matmul(float* C, float* A, float* B, int N) {

shared float tileA[TILE][TILE]; // shared workbench

shared float tileB[TILE][TILE]; // 2 × 16² × 4B = 2 KB

int row = blockIdx.y * TILE + threadIdx.y;

int col = blockIdx.x * TILE + threadIdx.x;

float sum = 0.0f;

for (int t = 0; t < N / TILE; t++) {

// Phase 1: Load tile from HBM → shared memory

tileA[threadIdx.y][threadIdx.x] = A[row * N + (t * TILE + threadIdx.x)];

tileB[threadIdx.y][threadIdx.x] = B[(t * TILE + threadIdx.y) * N + col];

__syncthreads(); // SYNC #1: wait for all loads

// Phase 2: Compute partial dot product from SRAM

for (int k = 0; k < TILE; k++)

sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];

__syncthreads(); // SYNC #2: wait before overwrite

}

C[row * N + col] = sum;

}

The Double-Sync Pattern

Why two syncs? Sync #1 prevents reading a tile before it's fully loaded — if thread 5 starts computing before thread 200 finishes loading, it reads garbage. Sync #2 prevents overwriting a tile before all threads finished using it — if the loop advances and thread 5 starts loading the next tile while thread 200 is still reading the current tile, the data is corrupted. Missing either sync = race condition bug. This is the #1 source of tiling bugs in CUDA forums.

Boundary Conditions

What if N isn't divisible by TILE? Add bounds checks: if (row < N && t*TILE+tx < N) and pad with zeros. Production kernels handle this; we keep it clean here to focus on the core pattern.

On the right panel: Watch the animation phases: Load (arrows from HBM to SRAM, counter ticks) → Compute (reads from SRAM, no HBM traffic) → Slide (next tile). The code viewer below highlights the corresponding lines.

Why Tiling Works

Quantifying the Improvement

Tiling doesn't reduce computation — same number of multiply-adds. It reduces memory traffic.

Without tiling: each element of A is loaded N times (once per output column). Total HBM reads: 2N³.

With tile size T: each element loaded N/T times. Total: 2N³ / T.

Tile Size	HBM Reads (N=1024)	Reduction	SRAM per Block
1 (naive)	2.0 billion	1×	0 bytes
4	512 million	4×	128 bytes
8	256 million	8×	512 bytes
16	128 million	16×	2 KB
32	64 million	32×	8 KB

The Roofline Connection

On the roofline, arithmetic intensity = FLOPs / bytes moved. Same FLOPs, T× fewer bytes → T× higher arithmetic intensity. The kernel shifts rightward — from deep in memory-bound territory toward the compute ridge.

This is the fundamental optimization pattern: increase arithmetic intensity by reusing data in fast memory. Every GPU optimization in the next three modules is a variation of this idea.

Why Not T=64?

Shared memory is limited (~100 KB per SM). Two 64×64 float tiles = 32 KB — feasible. Two 128×128 tiles = 128 KB — exceeds capacity. T=16 or T=32 is the practical sweet spot.

On the right panel: Drag the tile size selector. Watch the traffic bar shrink and the roofline dot shift right — larger tiles mean more data reuse, higher arithmetic intensity, closer to compute-bound.

Tree Reduction

The Second Shared-Memory Pattern

Tiling is the first fundamental shared-memory pattern: load once, reuse many. The second is tree reduction: summing N values in parallel.

Where Reduction Is Used

softmax needs sum(exp(x)) across the sequence
layernorm needs mean and variance (both are sums)
loss computation needs a global sum across all tokens
attention score normalization needs a row-wise sum

All require reducing N values to 1. Doing this sequentially — one addition at a time — wastes the GPU's massive parallelism.

The Algorithm

32 values in shared memory. Step 1: 16 threads each sum a pair. Step 2: 8 threads sum pairs of sums. Step 3: 4 threads. Step 4: 2 threads. Step 5: 1 thread holds the final sum.

log₂(32) = 5 steps — versus 31 steps sequentially.

Avoiding Warp Divergence

A naive approach: thread t adds elements 2t and 2t+1. Sounds natural, but threads 1, 3, 5, 7... are idle at every step — odd-indexed threads skip. Half the warp diverges.

The fix (sequential addressing): thread t adds elements t and t + stride, where stride halves each step (16, 8, 4, 2, 1). Now threads 0-15 are all active in step 1, threads 0-7 in step 2 — always consecutive threads, filling warps without divergence.

shared float sdata[256];

sdata[tid] = input[tid];

__syncthreads();

for (int s = blockDim.x / 2; s > 0; s >>= 1) {

if (tid < s)

sdata[tid] += sdata[tid + s];

__syncthreads();

}

// sdata[0] holds the final sum

Each step: stride s halves, thread tid adds sdata[tid + s]. Only threads where tid < s participate — always the first threads, always consecutive, no divergence.

On the right panel: Watch which threads are active (bright) vs idle (dimmed) at each step. The active region shrinks by half — but it's always the first threads, keeping warps full. The efficiency bar shows active thread count dropping from 32 → 16 → 8 → 4 → 2 → 1.

From Tiling to FlashAttention

Two Patterns, One Principle

You now know the two fundamental patterns that power modern GPU kernels:

Tiling — load once, reuse many. Reduces HBM traffic by T×.
Tree reduction — parallel sum in log₂(N) steps. Powers softmax, layernorm, loss.

Every optimization in the next three modules builds on these.

Same principle, two applications

Tiled Matmul

Tiles A, B along K dim

Input tiles in SRAM

T× fewer HBM reads

FlashAttention

Tiles Q, K, V along seq

Partial scores in SRAM

No N×N in HBM

Same principle: keep working data in fast memory

FlashAttention (Module 8) applies the same tiling principle to attention: instead of materializing the full N×N attention matrix in HBM, it tiles the computation along the sequence dimension, keeping partial scores in SRAM. Same principle — keep working data in fast memory, never write intermediates to HBM.

Practice vs Theory

In practice, production code delegates matmul to cuBLAS — NVIDIA's heavily optimized matrix multiply library. Karpathy's llm.c does exactly this for the training loop. But understanding why tiling works is what lets you reason about FlashAttention, operator fusion, and every kernel optimization in Modules 7-9.

What's Next

Module 7 introduces Tensor Cores — hardware that performs a 4×4 matrix multiply-accumulate in a single instruction. Tiling determines how you feed data to Tensor Cores. The optimization stack: tiling (software) × Tensor Cores (hardware) × fusion (compiler).