Tiling & Matrix Multiplication on GPUs
Naive Matrix Multiply
What is Tiled Matrix Multiplication?
Tiled matrix multiplication is a GPU optimization that loads small blocks (tiles) of input matrices into fast shared memory, so hundreds of threads can reuse the same data instead of each fetching it from slow global memory. This reduces memory traffic by the tile dimension — typically 16×, turning a memory-bound kernel into a near compute-bound one.
The Redundancy Problem
Imagine a factory where every worker walks to the warehouse to fetch the same parts — 32 workers need the same row of bolts, each making a separate trip. That's naive matmul: every thread independently reads from HBM.
Each output cell C[i][j] computes a dot product: it reads the entire row i of A and the entire column j of B. For an N×N matrix, that's 2N reads per output cell — and there are N² output cells. Total: 2N³ reads from HBM.
But adjacent outputs share inputs. C[i][0] and C[i][1] both read the same row i of A. C[0][j] and C[1][j] both read column j of B. The data is already in HBM — it's being fetched over and over.
For N=1024: 2 × 1024³ = 2 billion float reads. Most of them redundant.
Each thread's inner loop reads N elements from A (green) and N from B (blue). Every thread in the same row reads the same row of A — but each fetches it independently from HBM.
On the right panel: Click two adjacent output cells in C. Notice how they share an entire row of A (highlighted yellow) — that shared data is fetched from HBM twice. Tiling eliminates this redundancy.
The Tiling Idea
Load Once, Reuse Many
What if workers pooled their fetches? One trip to the warehouse, parts placed on a shared workbench, everyone grabs what they need from there. That's tiling.
Instead of each thread reading its own row/column from HBM, a thread block cooperatively loads a small tile (e.g., 16×16) of A and B into shared memory — fast, on-chip SRAM that all threads in the block can access. Then every thread reads from shared memory instead of HBM.
Recall from the Memory Hierarchy module: shared memory is ~100× faster than HBM, per-block, and explicitly managed with __shared__.
The Sliding Window
A single tile pair only covers part of the dot product. The trick: a dot product is just a sum, and sums can be split into chunks.
Consider computing C[2][3] in an 8×8 matrix with tile size T=4. The full dot product has 8 terms:
Each tile covers T=4 terms. The kernel's sum variable accumulates across tiles — it's never reset, just added to. After N/T = 2 tiles, all 8 terms have been summed. This works because addition is associative — you can group terms any way you want and the result is the same.
The sliding window process:
- Load tile pair (A tile + B tile) from HBM → shared memory
- Compute partial dot products from shared memory → add to
sum - Slide to the next tile pair along K
- Repeat N/T times
After all tiles: sum holds the complete dot product.
The Payoff
Without tiling: each element of A is loaded N times (once for each output column that needs it).
With tile size T: each element is loaded N/T times. For T=16, that's a 16× reduction in HBM traffic. Same computation, dramatically less memory movement.
On the right panel: Watch the tile slide across the K dimension. The traffic counter at the bottom compares naive vs tiled — same output, dramatically fewer HBM reads.
Tiled Matmul Kernel
The Complete Tiled Matmul Kernel
Here's the complete kernel — the most CUDA-dense code in this track. Every line earns its place. We use 8×8 in the visualization, but real kernels use TILE=16 or 32.
The Double-Sync Pattern
Why two syncs? Sync #1 prevents reading a tile before it's fully loaded — if thread 5 starts computing before thread 200 finishes loading, it reads garbage. Sync #2 prevents overwriting a tile before all threads finished using it — if the loop advances and thread 5 starts loading the next tile while thread 200 is still reading the current tile, the data is corrupted. Missing either sync = race condition bug. This is the #1 source of tiling bugs in CUDA forums.
Boundary Conditions
What if N isn't divisible by TILE? Add bounds checks: if (row < N && t*TILE+tx < N) and pad with zeros. Production kernels handle this; we keep it clean here to focus on the core pattern.
On the right panel: Watch the animation phases: Load (arrows from HBM to SRAM, counter ticks) → Compute (reads from SRAM, no HBM traffic) → Slide (next tile). The code viewer below highlights the corresponding lines.
Why Tiling Works
Quantifying the Improvement
Tiling doesn't reduce computation — same number of multiply-adds. It reduces memory traffic.
Without tiling: each element of A is loaded N times (once per output column). Total HBM reads: 2N³.
With tile size T: each element loaded N/T times. Total: 2N³ / T.
| Tile Size | HBM Reads (N=1024) | Reduction | SRAM per Block |
|---|---|---|---|
| 1 (naive) | 2.0 billion | 1× | 0 bytes |
| 4 | 512 million | 4× | 128 bytes |
| 8 | 256 million | 8× | 512 bytes |
| 16 | 128 million | 16× | 2 KB |
| 32 | 64 million | 32× | 8 KB |
The Roofline Connection
On the roofline, arithmetic intensity = FLOPs / bytes moved. Same FLOPs, T× fewer bytes → T× higher arithmetic intensity. The kernel shifts rightward — from deep in memory-bound territory toward the compute ridge.
This is the fundamental optimization pattern: increase arithmetic intensity by reusing data in fast memory. Every GPU optimization in the next three modules is a variation of this idea.
Why Not T=64?
Shared memory is limited (~100 KB per SM). Two 64×64 float tiles = 32 KB — feasible. Two 128×128 tiles = 128 KB — exceeds capacity. T=16 or T=32 is the practical sweet spot.
On the right panel: Drag the tile size selector. Watch the traffic bar shrink and the roofline dot shift right — larger tiles mean more data reuse, higher arithmetic intensity, closer to compute-bound.
Tree Reduction
The Second Shared-Memory Pattern
Tiling is the first fundamental shared-memory pattern: load once, reuse many. The second is tree reduction: summing N values in parallel.
Where Reduction Is Used
- softmax needs sum(exp(x)) across the sequence
- layernorm needs mean and variance (both are sums)
- loss computation needs a global sum across all tokens
- attention score normalization needs a row-wise sum
All require reducing N values to 1. Doing this sequentially — one addition at a time — wastes the GPU's massive parallelism.
The Algorithm
32 values in shared memory. Step 1: 16 threads each sum a pair. Step 2: 8 threads sum pairs of sums. Step 3: 4 threads. Step 4: 2 threads. Step 5: 1 thread holds the final sum.
log₂(32) = 5 steps — versus 31 steps sequentially.
Avoiding Warp Divergence
A naive approach: thread t adds elements 2t and 2t+1. Sounds natural, but threads 1, 3, 5, 7... are idle at every step — odd-indexed threads skip. Half the warp diverges.
The fix (sequential addressing): thread t adds elements t and t + stride, where stride halves each step (16, 8, 4, 2, 1). Now threads 0-15 are all active in step 1, threads 0-7 in step 2 — always consecutive threads, filling warps without divergence.
Each step: stride s halves, thread tid adds sdata[tid + s]. Only threads where tid < s participate — always the first threads, always consecutive, no divergence.
On the right panel: Watch which threads are active (bright) vs idle (dimmed) at each step. The active region shrinks by half — but it's always the first threads, keeping warps full. The efficiency bar shows active thread count dropping from 32 → 16 → 8 → 4 → 2 → 1.
From Tiling to FlashAttention
Two Patterns, One Principle
You now know the two fundamental patterns that power modern GPU kernels:
- Tiling — load once, reuse many. Reduces HBM traffic by T×.
- Tree reduction — parallel sum in log₂(N) steps. Powers softmax, layernorm, loss.
Every optimization in the next three modules builds on these.
FlashAttention (Module 8) applies the same tiling principle to attention: instead of materializing the full N×N attention matrix in HBM, it tiles the computation along the sequence dimension, keeping partial scores in SRAM. Same principle — keep working data in fast memory, never write intermediates to HBM.
Practice vs Theory
In practice, production code delegates matmul to cuBLAS — NVIDIA's heavily optimized matrix multiply library. Karpathy's llm.c does exactly this for the training loop. But understanding why tiling works is what lets you reason about FlashAttention, operator fusion, and every kernel optimization in Modules 7-9.
What's Next
Module 7 introduces Tensor Cores — hardware that performs a 4×4 matrix multiply-accumulate in a single instruction. Tiling determines how you feed data to Tensor Cores. The optimization stack: tiling (software) × Tensor Cores (hardware) × fusion (compiler).