What is the GPU & CUDA track?

Nine interactive modules covering why GPUs exist, the execution model, memory hierarchy, the roofline model, memory access patterns, tiling and matrix multiply, tensor cores, operator fusion, and Triton + torch.compile.

Do I need a GPU to take this track?

No — all modules are browser-based simulations.

Is this CUDA-specific or general GPU?

Both. The concepts are general GPU architecture; the code examples use CUDA and Triton.

What is memory coalescing on a GPU?

Memory coalescing is when threads in a warp access consecutive memory addresses, allowing the hardware to combine all 32 requests into a single 128-byte transaction. Non-coalesced (strided or random) access requires multiple transactions for the same amount of data, wasting memory bandwidth. Coalescing is the single most impactful GPU memory optimization.

What is a memory transaction on a GPU?

A memory transaction is a 128-byte cache line fetch from global memory (HBM). DRAM hardware always fetches a full cache line regardless of how many bytes are needed. If a warp's 32 threads access scattered addresses, each may trigger a separate transaction — fetching up to 4,096 bytes to deliver just 128 bytes of useful data.

Why is strided memory access slow on GPUs?

Strided access forces the GPU to issue multiple 128-byte transactions where only a fraction of each contains useful data. At stride 8, eight transactions fetch 1,024 bytes but only 128 are used — 87.5% wasted. The most common cause is column access in row-major matrices, where the stride equals the row width.

What are shared memory bank conflicts?

Shared memory has 32 banks, each 4 bytes wide. A bank conflict occurs when two or more threads access different addresses in the same bank, causing serial access. An N-way conflict takes N cycles. The exception: all threads reading the same address triggers a free broadcast, not a conflict.

How does the padding trick fix bank conflicts?

Declaring shared memory as float tile[32][33] instead of [32][32] adds one extra element per row, shifting each row's bank alignment by one position. Columns that mapped to the same bank now map to different banks. This eliminates 32-way conflicts in patterns like matrix transpose at negligible memory cost.

GPU Memory Access Patterns — Coalescing & Bank Conflicts

Why Access Patterns Matter

What are GPU Memory Access Patterns?

Imagine a warehouse where you can only order full pallets — even if you need just one item. If 32 workers each need an item from the same pallet, the warehouse makes one trip. But if each worker needs an item from a different pallet, the warehouse makes 32 trips — delivering 32 pallets when only 32 items were needed.

GPU memory works the same way. It delivers data in fixed-size chunks called cache lines (128 bytes each). How your threads request data — whether they ask for nearby addresses or scattered ones — determines whether the hardware makes 1 trip or 32.

This is the memory access pattern: the relationship between which thread asks for which address. It's the difference between using 3% and 100% of your GPU's memory bandwidth.

Same 32 threads, same data — different access patterns

Nearby addresses

1 cache line

1 trip

100% utilized

Scattered addresses

32 trips

3% utilized

DRAM Always Fetches 128 Bytes

The GPU's memory system doesn't fetch individual bytes. DRAM hardware always fetches a full 128-byte cache line — whether you asked for 4 bytes or 128. This isn't a design choice the programmer makes; it's how the silicon works.

What you request vs what DRAM fetches

Request

4 B

Fetched

unused

128 B

DRAM always fetches a full cache line — you pay for 128 bytes regardless

Think of it like the warehouse pallet: you pay for the whole pallet regardless of how many items you actually need from it.

Warps: 32 Threads Acting Together

Recall from the Execution Model module: a warp is a group of 32 threads that execute the same instruction simultaneously. When a warp issues a memory access, the hardware examines all 32 addresses and groups them into the fewest possible 128-byte transactions.

Best case: all 32 addresses fall in a single 128-byte cache line → 1 transaction. 128 bytes fetched, 128 bytes used. 100% utilization.
Worst case: all 32 addresses fall in different cache lines → 32 transactions. 4,096 bytes fetched, only 128 bytes used. 3.1% utilization.

Same hardware. Same clock speed. 32x difference in effective bandwidth — purely from how threads address memory.

Why This Matters

In a real matmul kernel, fixing memory access patterns alone improved throughput by 6.6x — from 300 to 2,000 GFLOPS. This is the single most impactful GPU memory optimization.

On the right panel: Toggle between "Coalesced" and "Scattered." Watch how the transaction count jumps from 1 to 32 — and the utilization drops from 100% to 3%.

Coalesced Access

The Ideal Pattern: Coalesced Access

Coalesced access means that consecutive threads read consecutive memory addresses. Thread 0 reads address 0, thread 1 reads address 4 (next float), thread 2 reads address 8, and so on. All 32 addresses fit within a single 128-byte cache line.

threadIdx.x → memory address

threads

...

T31

↓

addrs

...

124

All 32 addresses fit in one 128-byte cache line → 1 transaction

In CUDA Code

The simplest coalesced pattern is a copy kernel where each thread reads and writes one element:

global void

copy(float* dst, float* src, int n) {

int i = blockIdx.x * blockDim.x + threadIdx.x;

dst[i] = src[i]; // consecutive threads → consecutive addresses

}

The key is threadIdx.x: within a warp, it goes 0, 1, 2, ... 31. Since the array index i is directly based on threadIdx.x, consecutive threads access consecutive elements. 32 floats × 4 bytes = 128 bytes = exactly one cache line.

A Note on Alignment

On older GPUs (pre-Volta), the starting address also had to be 128-byte aligned for optimal coalescing. On modern GPUs (Ampere, Hopper), the L1 cache handles misalignment automatically. Coalescing — consecutive threads to consecutive addresses — is what matters.

On the right panel: Watch the warp sweep across memory. All 32 threads are served in a single 128-byte transaction — nothing wasted.

Strided & Random Access

When Threads Skip Addresses

Strided access means consecutive threads access addresses that are separated by a fixed gap (the stride). Instead of reading elements 0, 1, 2, 3..., they read elements 0, 8, 16, 24... (stride of 8).

global void

strided(float* dst, float* src, int n, int stride) {

int i = blockIdx.x * blockDim.x + threadIdx.x;

dst[i] = src[i * stride]; // threads skip addresses!

}

At stride 1, this is coalesced — 1 transaction, 100% utilization. But at stride 8, each thread's address is 32 bytes apart. The 32 threads now span 32 × 32 = 1,024 bytes — spread across 8 cache lines. The hardware must issue 8 separate 128-byte transactions, fetching 1,024 bytes to deliver just 128 bytes of useful data. 87.5% of bandwidth is wasted.

At stride 32, it's even worse: each thread hits a different cache line. 32 transactions, 4,096 bytes fetched, 3.1% utilization.

The Most Common Cause: Column Access

The most common real-world source of strided access is iterating down a column of a row-major matrix.

4×4 matrix stored row-major in memory

0,0

0,1

0,2

0,3

1,0

1,1

1,2

1,3

2,0

2,1

2,2

2,3

3,0

3,1

3,2

3,3

Row access →

0,0

0,1

0,2

0,3

1,0

1,1

1,2

1,3

2,0

2,1

2,2

2,3

3,0

3,1

3,2

3,3

↓ Column access

Flat memory layout:

Row: 0,1,2,3 → consecutive ✓Col: 0,4,8,12 → stride 4 ✗

In a row-major layout, elements in the same row are consecutive in memory (coalesced), but elements in the same column are separated by the row width (strided). For a 1024-wide matrix, column access has stride 1024 — every thread hits a different cache line.

This is exactly what happens in a naive matrix transpose: reading rows is coalesced, but writing columns is strided. How do we fix it? The trick is to stage data through shared memory — Step 5 shows exactly how this turns a strided write into a coalesced one.

On the right panel: Drag the stride slider from 1 to 32. Watch the transaction packets fill with wasted bytes — the dashed regions are bandwidth you paid for but didn't use. The mini roofline chart below shows your kernel sliding deeper into memory-bound territory.

Shared Memory Bank Conflicts

From Global Memory to Shared Memory

So far we've focused on global memory (HBM) access patterns. Shared memory — the fast, on-chip memory that threads in a block share — has its own access concern: bank conflicts.

32 Banks, 4 Bytes Wide

Shared memory is divided into 32 banks, each 4 bytes wide. Consecutive 4-byte words map to consecutive banks, wrapping around after bank 31.

bank = (address / 4) % 32

addr 0–3→B0

addr 4–7→B1

addr 8–11→B2

addr 12–15→B3

addr ⋮→⋮

addr 124–127→B31

addr 128–131→B0← wraps around!

The formula is: bank = (address / 4) % 32

When 32 threads each access a different bank, all accesses happen in parallel — one cycle. But when multiple threads hit the same bank with different addresses, those accesses serialize.

N-Way Conflicts

If N threads access different addresses in the same bank, the accesses take N cycles instead of 1. This is an N-way bank conflict.

shared

float smem[1024];

float val = smem[threadIdx.x * stride]; // stride determines bank pattern

At stride 1: thread 0 → bank 0, thread 1 → bank 1, ... thread 31 → bank 31. All different banks. No conflict.

At stride 32: thread 0 → bank 0, thread 1 → bank 0, ... all threads → bank 0. 32-way conflict — 32x slower.

The Broadcast Exception

There's one important exception: when all threads read the same address, the hardware broadcasts the value to all threads for free. No conflict. Conflicts only happen when threads read different addresses in the same bank.

The Fix: Padding

The classic fix for bank conflicts in 2D shared memory arrays is to add one extra column:

Padding shifts bank assignments per row

[32][32] — conflicts

↓ Column = same bank!

[32][33] — fixed

↓ Column = different banks ✓

__shared__ float tile[32][33]; — the extra column shifts each row's bank mapping by 1 position. Columns that previously all mapped to the same bank now map to 32 different banks. The wasted memory (32 extra floats) is negligible compared to the 32x speedup.

On the right panel: Select different strides and watch how threads stack up in banks — taller stacks mean longer waits. Try the "Broadcast" checkbox to see the exception, then "Padding" to watch 32-way conflicts vanish.

Putting It Together

The Matrix Transpose Problem

Matrix transpose is the canonical example of memory access patterns in action. Transposing means reading rows and writing columns — but in row-major memory, column writes are strided.

Naive Transpose

out[col * N + row] = in[row * N + col];

// read: coalesced (row access) ✓ write: strided (column access) ✗

The read side is fine — consecutive threads read consecutive elements in a row. But the write side is strided: consecutive threads write to addresses separated by N (the row width). For a 1024×1024 matrix, that's stride 1024 — every write hits a different cache line.

Result: ~19 GB/s effective bandwidth. The hardware fetches 32x more data than needed on the write side.

Optimized Transpose: Shared Memory Staging

The fix: load a tile into shared memory (coalesced read from global), then write from shared memory in transposed order (coalesced write to global). Shared memory handles the transpose — it's fast enough that strided access there is cheap.

Optimized transpose: both global accesses coalesced

Global (input)

coalesced read →

→

Shared mem

transpose here

→

Global (output)

coalesced write →

Both global reads AND writes are coalesced — shared memory handles the transpose

shared float tile[32][33]; // +1 padding avoids bank conflicts

tile[threadIdx.y][threadIdx.x] = in[y * N + x]; // coalesced global read

__syncthreads();

out[y2 * N + x2] = tile[threadIdx.x][threadIdx.y]; // coalesced global write

Both global reads AND global writes are now coalesced. The transpose happens in shared memory, where bank conflicts are avoided by the [32][33] padding trick from Step 4.

Result: ~100 GB/s effective bandwidth. 5x improvement — purely from fixing access patterns.

The Roofline Payoff

On the roofline chart, the naive kernel sits deep in memory-bound territory — it achieves only a fraction of peak bandwidth. The optimized kernel moves rightward, closer to the ridge point. Same arithmetic, same FLOPs — just better memory access patterns.

This is the core lesson of this module: access patterns determine where you sit on the roofline. The roofline model tells you whether you're memory-bound; this module taught you the most common reason why — and how to fix it.

What's Next

Module 6 (Tiling & Matrix Multiply) takes this further — using shared memory not just for transpose, but for the most important operation in deep learning: matrix multiplication. The tiling pattern loads data coalesced into shared memory and reuses it many times, dramatically raising arithmetic intensity.