Learn AI VisuallyTracksAI Explained

Tiling & Matrix Multiplication on GPUs

Naive Matrix Multiply

What is Tiled Matrix Multiplication?

Tiled matrix multiplication is a GPU optimization that loads small blocks (tiles) of input matrices into fast shared memory, so hundreds of threads can reuse the same data instead of each fetching it from slow global memory. This reduces memory traffic by the tile dimension — typically 16×, turning a memory-bound kernel into a near compute-bound one.

The Redundancy Problem

Imagine a factory where every worker walks to the warehouse to fetch the same parts — 32 workers need the same row of bolts, each making a separate trip. That's naive matmul: every thread independently reads from HBM.

C[2][3] reads row 2 of A and column 3 of B
A×B=C
■ row 2 of A■ col 3 of B■ C[2][3]

Each output cell C[i][j] computes a dot product: it reads the entire row i of A and the entire column j of B. For an N×N matrix, that's 2N reads per output cell — and there are N² output cells. Total: 2N³ reads from HBM.

But adjacent outputs share inputs. C[i][0] and C[i][1] both read the same row i of A. C[0][j] and C[1][j] both read column j of B. The data is already in HBM — it's being fetched over and over.

For N=1024: 2 × 1024³ = 2 billion float reads. Most of them redundant.

global void
naive_matmul(float* C, float* A, float* B, int N) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
for (int k = 0; k < N; k++)
sum += A[row * N + k] * B[k * N + col];
C[row * N + col] = sum;
}

Each thread's inner loop reads N elements from A (green) and N from B (blue). Every thread in the same row reads the same row of A — but each fetches it independently from HBM.

On the right panel: Click two adjacent output cells in C. Notice how they share an entire row of A (highlighted yellow) — that shared data is fetched from HBM twice. Tiling eliminates this redundancy.

The Tiling Idea

Load Once, Reuse Many

What if workers pooled their fetches? One trip to the warehouse, parts placed on a shared workbench, everyone grabs what they need from there. That's tiling.

Tiling: load once into shared memory, reuse many times
Tile of A
16×16
+
Tile of B
16×16
→
Shared Mem
SRAM
read from SRAM
256 threads
HBM reads: 2 tiles (512 floats)
256 × full rows

Instead of each thread reading its own row/column from HBM, a thread block cooperatively loads a small tile (e.g., 16×16) of A and B into shared memory — fast, on-chip SRAM that all threads in the block can access. Then every thread reads from shared memory instead of HBM.

Recall from the Memory Hierarchy module: shared memory is ~100× faster than HBM, per-block, and explicitly managed with __shared__.

The Sliding Window

A single tile pair only covers part of the dot product. The trick: a dot product is just a sum, and sums can be split into chunks.

Consider computing C[2][3] in an 8×8 matrix with tile size T=4. The full dot product has 8 terms:

C[2][3] = A[2][0]·B[0][3] + A[2][1]·B[1][3] + A[2][2]·B[2][3] + A[2][3]·B[3][3] ← tile 0 (k=0..3)
+ A[2][4]·B[4][3] + A[2][5]·B[5][3] + A[2][6]·B[6][3] + A[2][7]·B[7][3] ← tile 1 (k=4..7)

Each tile covers T=4 terms. The kernel's sum variable accumulates across tiles — it's never reset, just added to. After N/T = 2 tiles, all 8 terms have been summed. This works because addition is associative — you can group terms any way you want and the result is the same.

The sliding window process:

  1. Load tile pair (A tile + B tile) from HBM → shared memory
  2. Compute partial dot products from shared memory → add to sum
  3. Slide to the next tile pair along K
  4. Repeat N/T times

After all tiles: sum holds the complete dot product.

The Payoff

Without tiling: each element of A is loaded N times (once for each output column that needs it).

With tile size T: each element is loaded N/T times. For T=16, that's a 16× reduction in HBM traffic. Same computation, dramatically less memory movement.

On the right panel: Watch the tile slide across the K dimension. The traffic counter at the bottom compares naive vs tiled — same output, dramatically fewer HBM reads.

Tiled Matmul Kernel

The Complete Tiled Matmul Kernel

Here's the complete kernel — the most CUDA-dense code in this track. Every line earns its place. We use 8×8 in the visualization, but real kernels use TILE=16 or 32.

#define TILE 16
global void
tiled_matmul(float* C, float* A, float* B, int N) {
shared float tileA[TILE][TILE]; // shared workbench
shared float tileB[TILE][TILE]; // 2 × 16² × 4B = 2 KB
int row = blockIdx.y * TILE + threadIdx.y;
int col = blockIdx.x * TILE + threadIdx.x;
float sum = 0.0f;
for (int t = 0; t < N / TILE; t++) {
// Phase 1: Load tile from HBM → shared memory
tileA[threadIdx.y][threadIdx.x] = A[row * N + (t * TILE + threadIdx.x)];
tileB[threadIdx.y][threadIdx.x] = B[(t * TILE + threadIdx.y) * N + col];
__syncthreads(); // SYNC #1: wait for all loads
// Phase 2: Compute partial dot product from SRAM
for (int k = 0; k < TILE; k++)
sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
__syncthreads(); // SYNC #2: wait before overwrite
}
C[row * N + col] = sum;
}

The Double-Sync Pattern

Why two syncs? Sync #1 prevents reading a tile before it's fully loaded — if thread 5 starts computing before thread 200 finishes loading, it reads garbage. Sync #2 prevents overwriting a tile before all threads finished using it — if the loop advances and thread 5 starts loading the next tile while thread 200 is still reading the current tile, the data is corrupted. Missing either sync = race condition bug. This is the #1 source of tiling bugs in CUDA forums.

Boundary Conditions

What if N isn't divisible by TILE? Add bounds checks: if (row < N && t*TILE+tx < N) and pad with zeros. Production kernels handle this; we keep it clean here to focus on the core pattern.

On the right panel: Watch the animation phases: Load (arrows from HBM to SRAM, counter ticks) → Compute (reads from SRAM, no HBM traffic) → Slide (next tile). The code viewer below highlights the corresponding lines.

Why Tiling Works

Quantifying the Improvement

Tiling doesn't reduce computation — same number of multiply-adds. It reduces memory traffic.

Without tiling: each element of A is loaded N times (once per output column). Total HBM reads: 2N³.

With tile size T: each element loaded N/T times. Total: 2N³ / T.

Tile SizeHBM Reads (N=1024)ReductionSRAM per Block
1 (naive)2.0 billion1×0 bytes
4512 million4×128 bytes
8256 million8×512 bytes
16128 million16×2 KB
3264 million32×8 KB

The Roofline Connection

On the roofline, arithmetic intensity = FLOPs / bytes moved. Same FLOPs, T× fewer bytes → T× higher arithmetic intensity. The kernel shifts rightward — from deep in memory-bound territory toward the compute ridge.

This is the fundamental optimization pattern: increase arithmetic intensity by reusing data in fast memory. Every GPU optimization in the next three modules is a variation of this idea.

Why Not T=64?

Shared memory is limited (~100 KB per SM). Two 64×64 float tiles = 32 KB — feasible. Two 128×128 tiles = 128 KB — exceeds capacity. T=16 or T=32 is the practical sweet spot.

On the right panel: Drag the tile size selector. Watch the traffic bar shrink and the roofline dot shift right — larger tiles mean more data reuse, higher arithmetic intensity, closer to compute-bound.

Tree Reduction

The Second Shared-Memory Pattern

Tiling is the first fundamental shared-memory pattern: load once, reuse many. The second is tree reduction: summing N values in parallel.

Where Reduction Is Used

  • softmax needs sum(exp(x)) across the sequence
  • layernorm needs mean and variance (both are sums)
  • loss computation needs a global sum across all tokens
  • attention score normalization needs a row-wise sum

All require reducing N values to 1. Doing this sequentially — one addition at a time — wastes the GPU's massive parallelism.

The Algorithm

32 values in shared memory. Step 1: 16 threads each sum a pair. Step 2: 8 threads sum pairs of sums. Step 3: 4 threads. Step 4: 2 threads. Step 5: 1 thread holds the final sum.

log₂(32) = 5 steps — versus 31 steps sequentially.

Avoiding Warp Divergence

A naive approach: thread t adds elements 2t and 2t+1. Sounds natural, but threads 1, 3, 5, 7... are idle at every step — odd-indexed threads skip. Half the warp diverges.

The fix (sequential addressing): thread t adds elements t and t + stride, where stride halves each step (16, 8, 4, 2, 1). Now threads 0-15 are all active in step 1, threads 0-7 in step 2 — always consecutive threads, filling warps without divergence.

shared float sdata[256];
sdata[tid] = input[tid];
__syncthreads();
for (int s = blockDim.x / 2; s > 0; s >>= 1) {
if (tid < s)
sdata[tid] += sdata[tid + s];
__syncthreads();
}
// sdata[0] holds the final sum

Each step: stride s halves, thread tid adds sdata[tid + s]. Only threads where tid < s participate — always the first threads, always consecutive, no divergence.

On the right panel: Watch which threads are active (bright) vs idle (dimmed) at each step. The active region shrinks by half — but it's always the first threads, keeping warps full. The efficiency bar shows active thread count dropping from 32 → 16 → 8 → 4 → 2 → 1.

From Tiling to FlashAttention

Two Patterns, One Principle

You now know the two fundamental patterns that power modern GPU kernels:

  • Tiling — load once, reuse many. Reduces HBM traffic by T×.
  • Tree reduction — parallel sum in log₂(N) steps. Powers softmax, layernorm, loss.

Every optimization in the next three modules builds on these.

Same principle, two applications
Tiled Matmul
Tiles A, B along K dim
Input tiles in SRAM
T× fewer HBM reads
vs
FlashAttention
Tiles Q, K, V along seq
Partial scores in SRAM
No N×N in HBM
Same principle: keep working data in fast memory

FlashAttention (Module 8) applies the same tiling principle to attention: instead of materializing the full N×N attention matrix in HBM, it tiles the computation along the sequence dimension, keeping partial scores in SRAM. Same principle — keep working data in fast memory, never write intermediates to HBM.

Practice vs Theory

In practice, production code delegates matmul to cuBLAS — NVIDIA's heavily optimized matrix multiply library. Karpathy's llm.c does exactly this for the training loop. But understanding why tiling works is what lets you reason about FlashAttention, operator fusion, and every kernel optimization in Modules 7-9.

What's Next

Module 7 introduces Tensor Cores — hardware that performs a 4×4 matrix multiply-accumulate in a single instruction. Tiling determines how you feed data to Tensor Cores. The optimization stack: tiling (software) × Tensor Cores (hardware) × fusion (compiler).

Frequently Asked Questions

© 2026 Learn AI Visuallycraftsman@craftsmanapps.com