GPU Memory Access Patterns — Coalescing & Bank Conflicts
Why Access Patterns Matter
What are GPU Memory Access Patterns?
Imagine a warehouse where you can only order full pallets — even if you need just one item. If 32 workers each need an item from the same pallet, the warehouse makes one trip. But if each worker needs an item from a different pallet, the warehouse makes 32 trips — delivering 32 pallets when only 32 items were needed.
GPU memory works the same way. It delivers data in fixed-size chunks called cache lines (128 bytes each). How your threads request data — whether they ask for nearby addresses or scattered ones — determines whether the hardware makes 1 trip or 32.
This is the memory access pattern: the relationship between which thread asks for which address. It's the difference between using 3% and 100% of your GPU's memory bandwidth.
DRAM Always Fetches 128 Bytes
The GPU's memory system doesn't fetch individual bytes. DRAM hardware always fetches a full 128-byte cache line — whether you asked for 4 bytes or 128. This isn't a design choice the programmer makes; it's how the silicon works.
What you request vs what DRAM fetches
DRAM always fetches a full cache line — you pay for 128 bytes regardless
Think of it like the warehouse pallet: you pay for the whole pallet regardless of how many items you actually need from it.
Warps: 32 Threads Acting Together
Recall from the Execution Model module: a warp is a group of 32 threads that execute the same instruction simultaneously. When a warp issues a memory access, the hardware examines all 32 addresses and groups them into the fewest possible 128-byte transactions.
- Best case: all 32 addresses fall in a single 128-byte cache line → 1 transaction. 128 bytes fetched, 128 bytes used. 100% utilization.
- Worst case: all 32 addresses fall in different cache lines → 32 transactions. 4,096 bytes fetched, only 128 bytes used. 3.1% utilization.
Same hardware. Same clock speed. 32x difference in effective bandwidth — purely from how threads address memory.
Why This Matters
In a real matmul kernel, fixing memory access patterns alone improved throughput by 6.6x — from 300 to 2,000 GFLOPS. This is the single most impactful GPU memory optimization.
On the right panel: Toggle between "Coalesced" and "Scattered." Watch how the transaction count jumps from 1 to 32 — and the utilization drops from 100% to 3%.
Coalesced Access
The Ideal Pattern: Coalesced Access
Coalesced access means that consecutive threads read consecutive memory addresses. Thread 0 reads address 0, thread 1 reads address 4 (next float), thread 2 reads address 8, and so on. All 32 addresses fit within a single 128-byte cache line.
threadIdx.x → memory address
In CUDA Code
The simplest coalesced pattern is a copy kernel where each thread reads and writes one element:
The key is threadIdx.x: within a warp, it goes 0, 1, 2, ... 31. Since the array index i is directly based on threadIdx.x, consecutive threads access consecutive elements. 32 floats × 4 bytes = 128 bytes = exactly one cache line.
A Note on Alignment
On older GPUs (pre-Volta), the starting address also had to be 128-byte aligned for optimal coalescing. On modern GPUs (Ampere, Hopper), the L1 cache handles misalignment automatically. Coalescing — consecutive threads to consecutive addresses — is what matters.
On the right panel: Watch the warp sweep across memory. All 32 threads are served in a single 128-byte transaction — nothing wasted.
Strided & Random Access
When Threads Skip Addresses
Strided access means consecutive threads access addresses that are separated by a fixed gap (the stride). Instead of reading elements 0, 1, 2, 3..., they read elements 0, 8, 16, 24... (stride of 8).
At stride 1, this is coalesced — 1 transaction, 100% utilization. But at stride 8, each thread's address is 32 bytes apart. The 32 threads now span 32 × 32 = 1,024 bytes — spread across 8 cache lines. The hardware must issue 8 separate 128-byte transactions, fetching 1,024 bytes to deliver just 128 bytes of useful data. 87.5% of bandwidth is wasted.
At stride 32, it's even worse: each thread hits a different cache line. 32 transactions, 4,096 bytes fetched, 3.1% utilization.
The Most Common Cause: Column Access
The most common real-world source of strided access is iterating down a column of a row-major matrix.
4×4 matrix stored row-major in memory
Flat memory layout:
In a row-major layout, elements in the same row are consecutive in memory (coalesced), but elements in the same column are separated by the row width (strided). For a 1024-wide matrix, column access has stride 1024 — every thread hits a different cache line.
This is exactly what happens in a naive matrix transpose: reading rows is coalesced, but writing columns is strided. How do we fix it? The trick is to stage data through shared memory — Step 5 shows exactly how this turns a strided write into a coalesced one.
On the right panel: Drag the stride slider from 1 to 32. Watch the transaction packets fill with wasted bytes — the dashed regions are bandwidth you paid for but didn't use. The mini roofline chart below shows your kernel sliding deeper into memory-bound territory.
Shared Memory Bank Conflicts
From Global Memory to Shared Memory
So far we've focused on global memory (HBM) access patterns. Shared memory — the fast, on-chip memory that threads in a block share — has its own access concern: bank conflicts.
32 Banks, 4 Bytes Wide
Shared memory is divided into 32 banks, each 4 bytes wide. Consecutive 4-byte words map to consecutive banks, wrapping around after bank 31.
bank = (address / 4) % 32
The formula is: bank = (address / 4) % 32
When 32 threads each access a different bank, all accesses happen in parallel — one cycle. But when multiple threads hit the same bank with different addresses, those accesses serialize.
N-Way Conflicts
If N threads access different addresses in the same bank, the accesses take N cycles instead of 1. This is an N-way bank conflict.
At stride 1: thread 0 → bank 0, thread 1 → bank 1, ... thread 31 → bank 31. All different banks. No conflict.
At stride 32: thread 0 → bank 0, thread 1 → bank 0, ... all threads → bank 0. 32-way conflict — 32x slower.
The Broadcast Exception
There's one important exception: when all threads read the same address, the hardware broadcasts the value to all threads for free. No conflict. Conflicts only happen when threads read different addresses in the same bank.
The Fix: Padding
The classic fix for bank conflicts in 2D shared memory arrays is to add one extra column:
Padding shifts bank assignments per row
[32][32] — conflicts
↓ Column = same bank!
[32][33] — fixed
↓ Column = different banks ✓
__shared__ float tile[32][33]; — the extra column shifts each row's bank mapping by 1 position. Columns that previously all mapped to the same bank now map to 32 different banks. The wasted memory (32 extra floats) is negligible compared to the 32x speedup.
On the right panel: Select different strides and watch how threads stack up in banks — taller stacks mean longer waits. Try the "Broadcast" checkbox to see the exception, then "Padding" to watch 32-way conflicts vanish.
Putting It Together
The Matrix Transpose Problem
Matrix transpose is the canonical example of memory access patterns in action. Transposing means reading rows and writing columns — but in row-major memory, column writes are strided.
Naive Transpose
The read side is fine — consecutive threads read consecutive elements in a row. But the write side is strided: consecutive threads write to addresses separated by N (the row width). For a 1024×1024 matrix, that's stride 1024 — every write hits a different cache line.
Result: ~19 GB/s effective bandwidth. The hardware fetches 32x more data than needed on the write side.
Optimized Transpose: Shared Memory Staging
The fix: load a tile into shared memory (coalesced read from global), then write from shared memory in transposed order (coalesced write to global). Shared memory handles the transpose — it's fast enough that strided access there is cheap.
Optimized transpose: both global accesses coalesced
Both global reads AND writes are coalesced — shared memory handles the transpose
Both global reads AND global writes are now coalesced. The transpose happens in shared memory, where bank conflicts are avoided by the [32][33] padding trick from Step 4.
Result: ~100 GB/s effective bandwidth. 5x improvement — purely from fixing access patterns.
The Roofline Payoff
On the roofline chart, the naive kernel sits deep in memory-bound territory — it achieves only a fraction of peak bandwidth. The optimized kernel moves rightward, closer to the ridge point. Same arithmetic, same FLOPs — just better memory access patterns.
This is the core lesson of this module: access patterns determine where you sit on the roofline. The roofline model tells you whether you're memory-bound; this module taught you the most common reason why — and how to fix it.
What's Next
Module 6 (Tiling & Matrix Multiply) takes this further — using shared memory not just for transpose, but for the most important operation in deep learning: matrix multiplication. The tiling pattern loads data coalesced into shared memory and reuses it many times, dramatically raising arithmetic intensity.