What is the GPU & CUDA track?

Nine interactive modules covering why GPUs exist, the execution model, memory hierarchy, the roofline model, memory access patterns, tiling and matrix multiply, tensor cores, operator fusion, and Triton + torch.compile.

Do I need a GPU to take this track?

No — all modules are browser-based simulations.

Is this CUDA-specific or general GPU?

Both. The concepts are general GPU architecture; the code examples use CUDA and Triton.

Why are GPUs faster than CPUs for machine learning?

GPUs have thousands of simple cores optimized for parallel throughput, while CPUs have a few complex cores optimized for serial speed. Since ML workloads are dominated by matrix multiplication — a massively parallel operation — GPUs can process them orders of magnitude faster. A modern H100 has ~17,000 CUDA cores plus specialized Tensor Cores delivering nearly 1 petaFLOP of FP16 compute, versus a server CPU's 64–128 cores at a few TFLOPs. For matmul-heavy workloads the gap is 50–100×, and that gap only widens on long sequences where GPU memory bandwidth also dominates.

What is the difference between CPU and GPU architecture?

CPUs dedicate most transistors to branch prediction, out-of-order execution, and large caches — optimizing for low latency on sequential tasks. GPUs dedicate most transistors to arithmetic units — optimizing for high throughput on parallel tasks. A CPU core can finish a single instruction quickly with deep pipelines and speculation, while a GPU hides latency by switching among thousands of in-flight threads. The practical consequence: CPUs win at irregular code with branching and data dependencies; GPUs win at regular code doing the same operation on many data elements, like matmul, convolution, and attention.

CUDA is NVIDIA's programming model and software platform for running general-purpose computations on GPUs. It provides APIs, compilers, and libraries that let developers write code that executes across thousands of GPU cores in parallel. Introduced in 2007, CUDA exposes a C/C++ programming model with __global__ kernels, a thread hierarchy (threads, blocks, grids), and explicit memory spaces. It also bundles libraries like cuBLAS (linear algebra), cuDNN (deep learning primitives), and NCCL (multi-GPU communication) that frameworks like PyTorch depend on, which is why CUDA remains the default target for ML.

Do I need to know CUDA to use GPUs for machine learning?

No. Frameworks like PyTorch and TensorFlow abstract away CUDA details. However, understanding GPU architecture and CUDA concepts helps you reason about performance — why some operations are slow, why batch size matters, and why techniques like FlashAttention work. When you hit a performance wall — a model that fits but runs slowly, an operation torch.compile can't fuse, an unexpected OOM — having a mental model of warps, SMs, shared memory, and HBM is what lets you diagnose the root cause. Most ML engineers never write a CUDA kernel but still benefit from reading one.

Why does NVIDIA dominate the GPU market for AI?

NVIDIA's dominance comes from the CUDA ecosystem — a decade of libraries (cuBLAS, cuDNN, TensorRT), framework integration (PyTorch, TensorFlow), and developer tools. Competing hardware exists (AMD ROCm, Intel oneAPI) but lacks the ecosystem maturity. The lock-in deepens at the kernel level: custom CUDA kernels like FlashAttention, PagedAttention, and SGMV were all written first for NVIDIA, so even if competing silicon matches raw FLOPs, the research-to-production pipeline still starts with CUDA. NVLink, H100 Transformer Engine, and a strong NIC partnership with InfiniBand widen the moat for multi-GPU training.

Why GPUs? CPU vs GPU Architecture Explained

The CPU: Few Powerful Cores

What is a GPU?

A GPU (Graphics Processing Unit) is a processor designed for massive parallelism — running thousands of simple operations simultaneously rather than a few complex ones quickly. Originally built for rendering pixels, GPUs now power most machine learning workloads because ML is dominated by matrix math, which is inherently parallel.

This module explains why GPUs exist, how they differ from CPUs, and where CUDA fits in.

The CPU: Built for Speed

A modern CPU has 8-64 cores, each one a sophisticated processor. Every core packs:

Branch prediction — guesses which code path to take next, so the pipeline never stalls
Out-of-order execution — reorders instructions on the fly to keep all execution units busy
Large caches — L1/L2/L3 caches (up to 256 MB total) to hide memory latency
Deep pipelines — 15-20+ stages to maximize clock speed

All of this complexity serves one goal: execute a single thread of instructions as fast as possible.

A CPU core is like an expert chef — extremely skilled, equipped with every tool, can tackle any recipe. But there are only 8 of them in the kitchen.

Try it: On the right panel, click "Start" to see both CPU and GPU process the same 32 tasks simultaneously. Focus on the CPU side (left) — notice how 8 cores process tasks one at a time, while the GPU finishes almost instantly.

Where CPUs Shine

CPUs excel at tasks with complex control flow — operating systems, web servers, compilers, databases. Code with lots of if/else branches, pointer chasing, and unpredictable memory access patterns is where all that branch prediction and cache hierarchy pays off.

But what happens when your workload is 10,000 identical operations with no dependencies between them?

The GPU: Thousands of Simple Cores

The GPU: Built for Throughput

A GPU takes the opposite design approach. Instead of 8 sophisticated cores, a modern GPU has thousands of simple cores — an NVIDIA H100 has 16,896 CUDA cores.

Each GPU core is far simpler than a CPU core:

No branch prediction — if threads diverge, both paths execute serially
In-order execution — instructions run in the order written
Tiny caches per core — latency is hidden by switching between thousands of threads, not by large caches

A GPU core is like a line cook — can only do one thing, but there are 16,000 of them working simultaneously.

Latency vs Throughput

This is the fundamental tradeoff:

	CPU	GPU
Cores	8-64 complex	10,000+ simple
Optimized for	Latency (fast per-task)	Throughput (fast total)
Latency hiding	Large caches, speculation	Massive thread parallelism
Best at	Complex serial logic	Identical parallel operations

Try it: Click "Start" on the right panel. Both CPU and GPU receive the same 32 tasks simultaneously — watch the GPU finish instantly while the CPU is still crawling through them one by one.

How GPUs Hide Latency

Fetching data from memory is slow — hundreds of clock cycles. CPUs and GPUs handle this wait very differently:

CPU approach: avoid the wait. CPUs use big caches (storing data nearby so it's fast to reach) and speculation (guessing what data you'll need next and fetching it early). When data isn't cached, the core stalls — it sits idle until the data arrives.

GPU approach: do something else while waiting. When one group of threads is waiting for memory, the GPU instantly switches to another group that's ready to compute. With thousands of threads queued up, there's always someone ready to work. No one waits idle.

Think of it like a restaurant kitchen. A CPU chef waits for ingredients to arrive before starting the next dish. A GPU kitchen has 1,000 cooks — while some wait for ingredients, hundreds of others are already cooking. The kitchen never stops.

ML Is Matrix Math

Why GPUs Dominate Machine Learning

Nearly every operation in a neural network is a matrix multiplication:

Linear layers: output = input × weights (matmul)
Attention: scores = Q × K^T (matmul), output = scores × V (matmul)
Embedding lookup: technically a gather, but implemented as sparse matmul
Convolutions: can be expressed as matmul via im2col

But what is matrix multiplication, and why is it so GPU-friendly?

Try it on the right panel: Click any cell in the output matrix C. You'll see its dot product — multiply each pair from a row of A and a column of B, then add them up. Click "Animate step by step" to watch the calculation unfold.

Why GPUs Love This

Each output cell is a completely independent calculation. Cell C[0][0] doesn't need to wait for C[0][1] — they use different row-column pairs. This means all cells can be computed at the same time.

See the difference: Switch to the "Matmul: CPU vs GPU" tab on the right panel. CPU computes one cell at a time. GPU computes all 16 simultaneously.

A 4×4 matrix has just 16 output cells. Real neural networks use 4096×4096 matrices — that's 16 million independent operations, all running simultaneously on GPU cores.

The Numbers

TFLOPS (Tera FLOPS) means trillions (1,000,000,000,000 — 12 zeros) of floating-point operations per second — it measures how much math a chip can crunch every second.

A single forward pass through a 7B parameter model involves roughly:

14 billion multiply-add operations per token
At FP16: an H100 GPU delivers 990 TFLOPS of matrix math via Tensor Cores
The same CPU might deliver 1-2 TFLOPS

That's a 500-1000x difference for matrix-heavy workloads.

When GPUs Don't Help

Not everything benefits from GPU parallelism:

Small workloads — launching work on a GPU has overhead (~5-10 microseconds). If the computation itself takes less than that, the CPU wins.
Sequential algorithms — sorting, tree traversal, and graph algorithms with data dependencies between steps can't exploit massive parallelism.
Irregular memory access — random pointer chasing kills GPU performance because it can't coalesce memory reads.

Machine learning hits the sweet spot: large, regular, parallel matrix operations.

What Is CUDA?

CUDA (Compute Unified Device Architecture) is NVIDIA's programming model for running general-purpose computations on GPUs. Released in 2007, it transformed GPUs from graphics-only hardware into programmable parallel processors.

Before CUDA, running non-graphics code on a GPU required disguising computations as pixel-shading operations — a painful hack. CUDA gave developers a C-like language to write GPU code directly.

Look at the right panel: The software stack from your Python code down to GPU hardware. Hover each layer to see what it does. The green "this track" labels show which layers this track teaches.

The Software Stack

When you write model(input) in PyTorch, here's what actually happens:

PyTorch breaks your model into operations (matmul, softmax, layernorm...)
torch.compile / Triton (optional) fuses operations and generates optimized GPU code
CUDA libraries (cuBLAS, cuDNN) provide hand-tuned implementations of common operations
CUDA driver compiles kernels to PTX (GPU assembly) and launches them
GPU hardware executes the kernels across thousands of cores

Most ML engineers work at level 1 — they write Python and let PyTorch handle everything below. But understanding the lower layers helps you reason about why things are fast or slow.

Why NVIDIA Dominates

NVIDIA's moat isn't the hardware — it's the ecosystem:

CUDA — the programming model (this track)
cuBLAS / cuDNN — optimized math and neural network libraries
TensorRT — inference optimization engine
NCCL — multi-GPU communication library
Triton — Python-level kernel language (originally from OpenAI, now widely adopted)

Every major ML framework (PyTorch, TensorFlow, JAX) is built on this stack. Competing hardware (AMD ROCm, Intel oneAPI) exists but lacks a decade of ecosystem maturity.

What This Track Teaches

You don't need to write CUDA kernels. But you do need to understand:

Why some operations are slow (they're memory-bound, not compute-bound)
Why FlashAttention is fast (it minimizes HBM round-trips)
Why batch size matters (it improves GPU utilization)
Why quantization helps (lower precision = higher Tensor Core throughput)
How to read annotated CUDA code (not write production kernels)

The next 8 modules build this understanding, starting with how GPUs actually schedule and execute work.