GPU & CUDA
9 interactive modules from GPU execution model to Triton & torch.compile. All free.
GP
Module 1
Why GPUs?
CPU vs GPU design philosophy, throughput vs latency, and the CUDA software stack.
Start module →
SM
Module 2
Execution Model
Threads, warps, blocks, grids, and SMs — how GPUs schedule parallel work.
Start module →
HB
Module 3
Memory Hierarchy
Registers, shared memory, L2, HBM, PCIe, and NVLink — where data lives.
Start module →
RF
Module 4
Roofline Model
Compute-bound vs memory-bound — the universal performance mental model.
Start module →
MA
Module 5
Memory Access Patterns
Coalesced access, bank conflicts, and why memory layout matters.
Start module →
TL
Module 6
Tiling & Matrix Multiply
The fundamental GPU optimization — data reuse via shared memory tiling.
Start module →
TC
Module 7
Tensor Cores & Mixed Precision
Hardware matrix multiply, FP16/BF16/FP8/INT8, and why dims must align.
Start module →
OF
Module 8
Operator Fusion & FlashAttention
Fused kernels, IO-aware design, and why FlashAttention is fast.
Start module →
TR
Module 9
Triton & torch.compile
Python-level GPU programming and the abstraction stack.
Start module →