What is the GPU & CUDA track?

Nine interactive modules covering why GPUs exist, the execution model, memory hierarchy, the roofline model, memory access patterns, tiling and matrix multiply, tensor cores, operator fusion, and Triton + torch.compile.

Do I need a GPU to take this track?

No — all modules are browser-based simulations.

Is this CUDA-specific or general GPU?

Both. The concepts are general GPU architecture; the code examples use CUDA and Triton.

What is a CUDA thread?

A CUDA thread is the smallest unit of execution on a GPU. Each thread runs the same kernel code but operates on different data, identified by its unique threadIdx and blockIdx. Threads are organized into blocks, and blocks into a grid. A single kernel launch can spawn millions of threads — for example, a 4096×4096 element-wise kernel with 256 threads per block uses 65,536 blocks totaling ~17 million threads. The hardware multiplexes them onto SMs and warp schedulers, so the programmer writes per-thread logic but the runtime handles assignment.

What is a warp in GPU programming?

A warp is a group of 32 threads that execute instructions in lockstep (SIMT — Single Instruction, Multiple Threads). The GPU's warp scheduler selects a ready warp each cycle. Warps are the actual scheduling unit on NVIDIA GPUs. Because the hardware fetches and issues one instruction across all 32 threads at once, execution is most efficient when all threads in a warp take the same control path and access contiguous memory. Every major GPU optimization concept — coalesced memory access, bank conflicts, warp divergence, and occupancy — is rooted in how warps map to hardware resources.

What is warp divergence?

Warp divergence occurs when threads within the same warp take different execution paths (e.g., an if/else branch). Since all 32 threads must execute the same instruction, both paths run serially with inactive threads masked, reducing throughput. Worst case, a warp with 32 different branches executes 32× slower than a coherent one. Mitigations include reordering data so threads in the same warp take the same branch, masking work at the warp boundary rather than inside, and using warp-vote intrinsics. Simple elementwise kernels rarely diverge; complex control flow in reductions and sort kernels often does.

What is a Streaming Multiprocessor (SM)?

An SM is the fundamental processing unit on an NVIDIA GPU. Each SM contains CUDA cores, warp schedulers, register files, and shared memory. Thread blocks are assigned to SMs, and multiple blocks can run on one SM concurrently if resources allow. An H100 has 132 SMs, each with 128 FP32 CUDA cores plus four Tensor Cores, giving the GPU its ~17,000-core total. Occupancy — the number of active warps per SM divided by the hardware maximum — determines how well latency is hidden, and is gated by register and shared-memory usage per block.

How do I choose CUDA block size?

Block sizes of 128 or 256 threads are common defaults. The block size must be a multiple of 32 (warp size) for full utilization. The optimal size depends on register usage and shared memory per thread — higher occupancy usually means better latency hiding. NVIDIA's occupancy calculator and the cudaOccupancyMaxPotentialBlockSize API can pick a block size that maximizes the number of concurrent blocks on an SM given your kernel's resource demands. In practice, sweeping 64, 128, 256, 512 and picking the fastest is faster than tuning by hand, and many libraries auto-tune this at first use.

GPU Execution Model — Threads, Warps, and SMs

CUDA Hello World

What is the CUDA Execution Model?

The CUDA execution model is how NVIDIA GPUs organize and run parallel work. Instead of one thread doing everything, you launch thousands of threads that each do a tiny piece of the job simultaneously. The GPU hardware then schedules these threads across its processing units automatically.

This module walks through the full model: from your first kernel to the physical hardware that runs it.

Your First Kernel: Vector Addition

The simplest CUDA program adds two arrays element by element. On a CPU, you'd write a for-loop that processes elements one at a time. In CUDA, you create one thread for every element — if you have 1024 elements, you launch 1024 threads. Thread 0 computes C[0] = A[0] + B[0], Thread 1 computes C[1] = A[1] + B[1], and so on. All 1024 threads run the same code simultaneously, each on its own element.

On the right panel: Toggle between "Pseudocode" and "Real CUDA" to see how the same idea maps to actual GPU syntax. Hover over highlighted keywords for explanations.

What Makes This a "Kernel"?

A kernel is a function that runs on the GPU. The __global__ keyword marks it as callable from the CPU but executed on the GPU. When you launch a kernel, you specify how many threads to create — the GPU handles the rest.

The key insight: every thread runs the same code. The only difference is the thread's unique index (threadIdx.x + blockIdx.x * blockDim.x), which tells it which element to process.

The Host-Device Pattern

CUDA programs follow a standard flow — data must cross the PCIe bus between CPU and GPU:

CPU (Host)

GPU (Device)

float *a, *b

(input data)

float *c

(results)

copy→

←copy

1. cudaMalloc

reserve memory

▼

3. kernel<<<>>>()

1000s of threads

▼

done

1. Allocate2. Copy →3. Launch4. Copy ←

Data crosses the PCIe bus twice — GPUs need large workloads to pay off

This copy overhead is why you don't use GPUs for tiny computations — the data transfer costs more than the speedup. Module 3 (Memory Hierarchy) explores this in depth.

Try it: Click "Run" on the right panel to see 8 elements processed in parallel. Each thread handles exactly one A[i] + B[i] operation.

Threads, Blocks, and Grids

The Thread Hierarchy

CUDA organizes threads into a three-level logical hierarchy — this is how you structure work in your code, not how the physical hardware is built (that comes in Step 3):

Thread — the smallest unit of execution. Each thread runs the kernel code on one piece of data.
Block — a group of threads (typically 128 or 256) that can cooperate with each other via shared memory.
Grid — the collection of all blocks needed to process your entire dataset.

Think of it like a school: threads are students, blocks are classrooms, and the grid is the whole school. Students in the same classroom can share materials (shared memory), but classrooms are independent.

You define these logical groups in your code. The GPU hardware then decides where to physically run them — Step 3 shows how.

The Index Formula

Every thread needs to know which element to process. CUDA gives each thread three built-in variables — threadIdx.x (position in block), blockIdx.x (which block), and blockDim.x (threads per block) — which combine into a unique global index:

i = threadIdx.x + blockIdx.x × blockDim.x

Grid: 4 blocks × 4 threads = 16 threads

Block 0

Block 1

Block 2

Block 3

Global Index Formula

i =threadIdx.x+blockIdx.x×blockDim.x

i =1+2×4=9

threadIdx.x = 1 — position within blockblockIdx.x = 2 — which blockblockDim.x = 4 — threads per block

Click any thread to see its index calculation

Try it: On the right panel, click any thread cell to see its index calculation. Drag the N slider and change blockDim to see how the grid restructures.

Choosing Block Size

Block sizes must be a multiple of 32 (the warp size — explained in Step 4). Common choices:

128 threads — safe default, good for kernels that use many registers
256 threads — most common, balances occupancy and resource usage
512 threads — when you need more shared memory cooperation

The GPU launches ceil(N / blockDim) blocks. If N isn't evenly divisible, the last block has some threads that do nothing — guarded by the if (i < n) boundary check in the kernel.

Mapping to Hardware: SMs

From Code to Hardware

So far we've talked about the programmer's view: you write a grid of blocks, each containing threads. But where does this actually run on the physical chip?

Think of it like a restaurant:

Threads are individual dishes to prepare
Blocks are orders — a group of dishes that belong together
SMs are kitchen stations — physical workspaces where orders get prepared

You (the programmer) submit orders. The restaurant manager (GPU scheduler) assigns orders to whichever kitchen stations are free. You don't choose which station prepares your order — and you don't need to.

What is an SM?

A Streaming Multiprocessor (SM) is the physical compute unit on an NVIDIA GPU. Each SM has its own cores, registers, and shared memory.

Here's the important distinction: threads, blocks, and grids are logical — they exist in your code as an organizational structure. SMs are physical — they're actual silicon on the chip. The GPU's job is to map your logical blocks onto its physical SMs.

When a block is assigned to an SM, all of that block's threads run on that SM together. One SM can run multiple blocks at the same time if it has enough resources.

Different GPUs have different numbers of SMs:

GPU	SMs	Use case
A100	108	Data center (Ampere)
H100	144	Data center (Hopper)
B200	148	Data center (Blackwell, latest)
RTX 4090	128	Consumer (Ada Lovelace)
RTX 5090	170	Consumer (Blackwell, latest)

More SMs = more blocks can run at the same time = faster execution.

How Blocks Get Assigned to SMs

The GPU's block scheduler automatically assigns your thread blocks to SMs. The rules are simple:

A block runs entirely on one SM — it never splits across two SMs
Multiple blocks can share one SM — if the SM has enough room (registers, shared memory)
Block execution order is undefined — Block 0 might start after Block 7
You don't control the assignment — the hardware decides

Your code creates:

Grid

256 threads

→GPU
assigns

GPU hardware runs:

SM0

B0B3

SM1

B1B4

SM2

B2B5

You create blocks — the GPU decides which SM runs each one

This is what makes CUDA programs scalable. The same code runs on a GPU with 4 SMs or 132 SMs — more SMs just means more blocks run at the same time. You never hardcode "run this block on SM #5."

Try it: On the right panel, click "Assign" to watch blocks flow into SMs. Then click "Replay" — the blocks arrive in a different order each time, proving that execution order is not guaranteed.

Occupancy: How Full Are the SMs?

Each SM has a limited amount of workspace — like a kitchen station with only so much counter space. If one order (block) uses half the counter, you can fit two orders at once. If one order uses all of it, you can only run one at a time.

Occupancy is the percentage of an SM's capacity that's actually in use. Higher occupancy = more blocks running on each SM at the same time.

If an SM can fit 4 blocks but you only give it 1, three-quarters of its capacity sits empty. If your blocks are too large, fewer fit on each SM, and you waste resources.

The key takeaway: you want enough blocks to keep every SM busy, and each block should be small enough that multiple blocks fit on one SM. Step 4 explains why having multiple blocks per SM is so important for performance.

Warps: 32 Threads in Lockstep

Warps: The Real Execution Unit

Inside each SM, threads are grouped into warps of 32 threads. A block of 256 threads = 8 warps. The warp is what actually executes on the hardware — all 32 threads run the same instruction at the same time.

(Why "warp"? The name comes from weaving — on a loom, the "warp" is a set of parallel threads held in tension together. Same idea: 32 GPU threads running in lockstep.)

This model is called SIMT (Single Instruction, Multiple Threads): one instruction, 32 threads executing it simultaneously. It's similar to SIMD (Single Instruction, Multiple Data) on CPUs, but each thread has its own registers and can branch independently (with a penalty — see Step 5).

The Warp Scheduler

Each SM has warp schedulers that decide which warp to run each cycle. The key insight: switching between warps has zero cost.

Why zero cost? An SM has a large register file (e.g., 65,536 registers) that's divided up among all active warps at launch time. Each warp gets its own dedicated slice — Warp 0 might use registers 0-255, Warp 1 uses 256-511, and so on. These slices stay allocated the entire time.

So when the scheduler switches from Warp 0 to Warp 1, it doesn't save or load anything — both warps' registers are already sitting in the SM. It just starts reading from a different section. No saving, no loading, no overhead.

SM Register File — click to switch warps

Scheduler → Warp 0 (instant)CPU: save/load ~100s cycles

Try it: On the right panel, switch between the "Warp Scheduler" tab (see warps sliding in and out of the execution slot) and the "Timeline" tab (see how stalls are filled by other warps). Press Play to watch the animation.

How GPUs Hide Latency

Recall from Module 1: CPUs hide memory latency with big caches and branch prediction. GPUs use a completely different strategy — they switch to another warp.

When Warp 0 requests data from HBM (which takes 200-400 cycles to arrive), the scheduler instantly switches to Warp 1. If Warp 1 also stalls, switch to Warp 2. By the time all warps have stalled, Warp 0's data has arrived — it can resume.

This is why occupancy matters: more warps on an SM means more opportunities to hide latency. If you only have 2 warps and both stall, the SM sits idle. With 16 warps, idle time nearly disappears.

The Numbers

A typical SM can have 32-64 active warps (1024-2048 threads). Each clock cycle, it issues instructions from 1-2 warps. The rest are either:

Executing — actively running
Stalled — waiting for memory, waiting for a dependency
Ready — eligible to execute, waiting for a scheduler slot

Warp Divergence

When Threads Disagree

What happens when an if/else appears inside your kernel and threads in the same warp take different paths?

Since all 32 threads in a warp must execute the same instruction at the same time (SIMT), the GPU cannot run the if path for some threads and the else path for others simultaneously. Instead, it runs both paths sequentially:

Execute the if path — threads that don't take it are masked off (sit idle, waste their cycle)
Execute the else path — threads that took the if path are now masked
All threads converge and continue together

This is called warp divergence, and it means a divergent if/else takes 2× the time of a uniform branch.

Try it: On the right panel, click "Step" to walk through the code line by line. Watch which threads are active (green) and which are masked (✕) at each line. Then switch to "Uniform Branch" and compare.

When Divergence Is Free

Divergence is only a problem within a warp. If all 32 threads in Warp 0 take the if path, and all 32 threads in Warp 1 take the else path — that's perfectly fine. Different warps are independent.

This means divergence patterns that split cleanly along warp boundaries have no penalty. The worst case is when threads alternate: even threads take if, odd threads take else — both paths run with only half the threads active.

Common Divergence Patterns

Pattern	Impact
`if (threadIdx.x < N)`	Boundary check — only last warp affected. Minimal cost.
`if (threadIdx.x < 16)`	First half vs second half of each warp. Every warp diverges. Bad.
`if (data[i] > threshold)`	Data-dependent — unpredictable. Potentially bad if data is mixed.
`if (blockIdx.x < K)`	Block-level branch — all threads in a block agree. No divergence.

Practical Advice

Structure data so threads within a warp take the same path when possible
Move conditional logic to the block level when you can
Accept divergence for boundary checks (last warp only)
Profile before optimizing — divergence is often not the bottleneck

This completes the execution model. The next module explores the memory hierarchy — why data movement, not computation, is usually the real bottleneck.