GPU Execution Model — Threads, Warps, and SMs
CUDA Hello World
What is the CUDA Execution Model?
The CUDA execution model is how NVIDIA GPUs organize and run parallel work. Instead of one thread doing everything, you launch thousands of threads that each do a tiny piece of the job simultaneously. The GPU hardware then schedules these threads across its processing units automatically.
This module walks through the full model: from your first kernel to the physical hardware that runs it.
Your First Kernel: Vector Addition
The simplest CUDA program adds two arrays element by element. On a CPU, you'd write a for-loop that processes elements one at a time. In CUDA, you create one thread for every element — if you have 1024 elements, you launch 1024 threads. Thread 0 computes C[0] = A[0] + B[0], Thread 1 computes C[1] = A[1] + B[1], and so on. All 1024 threads run the same code simultaneously, each on its own element.
On the right panel: Toggle between "Pseudocode" and "Real CUDA" to see how the same idea maps to actual GPU syntax. Hover over highlighted keywords for explanations.
What Makes This a "Kernel"?
A kernel is a function that runs on the GPU. The __global__ keyword marks it as callable from the CPU but executed on the GPU. When you launch a kernel, you specify how many threads to create — the GPU handles the rest.
The key insight: every thread runs the same code. The only difference is the thread's unique index (threadIdx.x + blockIdx.x * blockDim.x), which tells it which element to process.
The Host-Device Pattern
CUDA programs follow a standard flow — data must cross the PCIe bus between CPU and GPU:
float *a, *b
(input data)
float *c
(results)
1. cudaMalloc
reserve memory
3. kernel<<<>>>()
1000s of threads
Data crosses the PCIe bus twice — GPUs need large workloads to pay off
This copy overhead is why you don't use GPUs for tiny computations — the data transfer costs more than the speedup. Module 3 (Memory Hierarchy) explores this in depth.
Try it: Click "Run" on the right panel to see 8 elements processed in parallel. Each thread handles exactly one A[i] + B[i] operation.
Threads, Blocks, and Grids
The Thread Hierarchy
CUDA organizes threads into a three-level logical hierarchy — this is how you structure work in your code, not how the physical hardware is built (that comes in Step 3):
- Thread — the smallest unit of execution. Each thread runs the kernel code on one piece of data.
- Block — a group of threads (typically 128 or 256) that can cooperate with each other via shared memory.
- Grid — the collection of all blocks needed to process your entire dataset.
Think of it like a school: threads are students, blocks are classrooms, and the grid is the whole school. Students in the same classroom can share materials (shared memory), but classrooms are independent.
You define these logical groups in your code. The GPU hardware then decides where to physically run them — Step 3 shows how.
The Index Formula
Every thread needs to know which element to process. CUDA gives each thread three built-in variables — threadIdx.x (position in block), blockIdx.x (which block), and blockDim.x (threads per block) — which combine into a unique global index:
i = threadIdx.x + blockIdx.x × blockDim.x
Grid: 4 blocks × 4 threads = 16 threads
Block 0
Block 1
Block 2
Block 3
Global Index Formula
Click any thread to see its index calculation
Try it: On the right panel, click any thread cell to see its index calculation. Drag the N slider and change blockDim to see how the grid restructures.
Choosing Block Size
Block sizes must be a multiple of 32 (the warp size — explained in Step 4). Common choices:
- 128 threads — safe default, good for kernels that use many registers
- 256 threads — most common, balances occupancy and resource usage
- 512 threads — when you need more shared memory cooperation
The GPU launches ceil(N / blockDim) blocks. If N isn't evenly divisible, the last block has some threads that do nothing — guarded by the if (i < n) boundary check in the kernel.
Mapping to Hardware: SMs
From Code to Hardware
So far we've talked about the programmer's view: you write a grid of blocks, each containing threads. But where does this actually run on the physical chip?
Think of it like a restaurant:
- Threads are individual dishes to prepare
- Blocks are orders — a group of dishes that belong together
- SMs are kitchen stations — physical workspaces where orders get prepared
You (the programmer) submit orders. The restaurant manager (GPU scheduler) assigns orders to whichever kitchen stations are free. You don't choose which station prepares your order — and you don't need to.
What is an SM?
A Streaming Multiprocessor (SM) is the physical compute unit on an NVIDIA GPU. Each SM has its own cores, registers, and shared memory.
Here's the important distinction: threads, blocks, and grids are logical — they exist in your code as an organizational structure. SMs are physical — they're actual silicon on the chip. The GPU's job is to map your logical blocks onto its physical SMs.
When a block is assigned to an SM, all of that block's threads run on that SM together. One SM can run multiple blocks at the same time if it has enough resources.
Different GPUs have different numbers of SMs:
| GPU | SMs | Use case |
|---|---|---|
| A100 | 108 | Data center (Ampere) |
| H100 | 144 | Data center (Hopper) |
| B200 | 148 | Data center (Blackwell, latest) |
| RTX 4090 | 128 | Consumer (Ada Lovelace) |
| RTX 5090 | 170 | Consumer (Blackwell, latest) |
More SMs = more blocks can run at the same time = faster execution.
How Blocks Get Assigned to SMs
The GPU's block scheduler automatically assigns your thread blocks to SMs. The rules are simple:
- A block runs entirely on one SM — it never splits across two SMs
- Multiple blocks can share one SM — if the SM has enough room (registers, shared memory)
- Block execution order is undefined — Block 0 might start after Block 7
- You don't control the assignment — the hardware decides
Your code creates:
Grid
B0
256 threads
B1
256 threads
B2
256 threads
B3
256 threads
B4
256 threads
B5
256 threads
assigns
GPU hardware runs:
You create blocks — the GPU decides which SM runs each one
This is what makes CUDA programs scalable. The same code runs on a GPU with 4 SMs or 132 SMs — more SMs just means more blocks run at the same time. You never hardcode "run this block on SM #5."
Try it: On the right panel, click "Assign" to watch blocks flow into SMs. Then click "Replay" — the blocks arrive in a different order each time, proving that execution order is not guaranteed.
Occupancy: How Full Are the SMs?
Each SM has a limited amount of workspace — like a kitchen station with only so much counter space. If one order (block) uses half the counter, you can fit two orders at once. If one order uses all of it, you can only run one at a time.
Occupancy is the percentage of an SM's capacity that's actually in use. Higher occupancy = more blocks running on each SM at the same time.
If an SM can fit 4 blocks but you only give it 1, three-quarters of its capacity sits empty. If your blocks are too large, fewer fit on each SM, and you waste resources.
The key takeaway: you want enough blocks to keep every SM busy, and each block should be small enough that multiple blocks fit on one SM. Step 4 explains why having multiple blocks per SM is so important for performance.
Warps: 32 Threads in Lockstep
Warps: The Real Execution Unit
Inside each SM, threads are grouped into warps of 32 threads. A block of 256 threads = 8 warps. The warp is what actually executes on the hardware — all 32 threads run the same instruction at the same time.
(Why "warp"? The name comes from weaving — on a loom, the "warp" is a set of parallel threads held in tension together. Same idea: 32 GPU threads running in lockstep.)
This model is called SIMT (Single Instruction, Multiple Threads): one instruction, 32 threads executing it simultaneously. It's similar to SIMD (Single Instruction, Multiple Data) on CPUs, but each thread has its own registers and can branch independently (with a penalty — see Step 5).
The Warp Scheduler
Each SM has warp schedulers that decide which warp to run each cycle. The key insight: switching between warps has zero cost.
Why zero cost? An SM has a large register file (e.g., 65,536 registers) that's divided up among all active warps at launch time. Each warp gets its own dedicated slice — Warp 0 might use registers 0-255, Warp 1 uses 256-511, and so on. These slices stay allocated the entire time.
So when the scheduler switches from Warp 0 to Warp 1, it doesn't save or load anything — both warps' registers are already sitting in the SM. It just starts reading from a different section. No saving, no loading, no overhead.
SM Register File — click to switch warps
Try it: On the right panel, switch between the "Warp Scheduler" tab (see warps sliding in and out of the execution slot) and the "Timeline" tab (see how stalls are filled by other warps). Press Play to watch the animation.
How GPUs Hide Latency
Recall from Module 1: CPUs hide memory latency with big caches and branch prediction. GPUs use a completely different strategy — they switch to another warp.
When Warp 0 requests data from HBM (which takes 200-400 cycles to arrive), the scheduler instantly switches to Warp 1. If Warp 1 also stalls, switch to Warp 2. By the time all warps have stalled, Warp 0's data has arrived — it can resume.
This is why occupancy matters: more warps on an SM means more opportunities to hide latency. If you only have 2 warps and both stall, the SM sits idle. With 16 warps, idle time nearly disappears.
The Numbers
A typical SM can have 32-64 active warps (1024-2048 threads). Each clock cycle, it issues instructions from 1-2 warps. The rest are either:
- Executing — actively running
- Stalled — waiting for memory, waiting for a dependency
- Ready — eligible to execute, waiting for a scheduler slot
Warp Divergence
When Threads Disagree
What happens when an if/else appears inside your kernel and threads in the same warp take different paths?
Since all 32 threads in a warp must execute the same instruction at the same time (SIMT), the GPU cannot run the if path for some threads and the else path for others simultaneously. Instead, it runs both paths sequentially:
- Execute the
ifpath — threads that don't take it are masked off (sit idle, waste their cycle) - Execute the
elsepath — threads that took theifpath are now masked - All threads converge and continue together
This is called warp divergence, and it means a divergent if/else takes 2× the time of a uniform branch.
Try it: On the right panel, click "Step" to walk through the code line by line. Watch which threads are active (green) and which are masked (✕) at each line. Then switch to "Uniform Branch" and compare.
When Divergence Is Free
Divergence is only a problem within a warp. If all 32 threads in Warp 0 take the if path, and all 32 threads in Warp 1 take the else path — that's perfectly fine. Different warps are independent.
This means divergence patterns that split cleanly along warp boundaries have no penalty. The worst case is when threads alternate: even threads take if, odd threads take else — both paths run with only half the threads active.
Common Divergence Patterns
| Pattern | Impact |
|---|---|
if (threadIdx.x < N) | Boundary check — only last warp affected. Minimal cost. |
if (threadIdx.x < 16) | First half vs second half of each warp. Every warp diverges. Bad. |
if (data[i] > threshold) | Data-dependent — unpredictable. Potentially bad if data is mixed. |
if (blockIdx.x < K) | Block-level branch — all threads in a block agree. No divergence. |
Practical Advice
- Structure data so threads within a warp take the same path when possible
- Move conditional logic to the block level when you can
- Accept divergence for boundary checks (last warp only)
- Profile before optimizing — divergence is often not the bottleneck
This completes the execution model. The next module explores the memory hierarchy — why data movement, not computation, is usually the real bottleneck.