Why GPUs? CPU vs GPU Architecture Explained
The CPU: Few Powerful Cores
What is a GPU?
A GPU (Graphics Processing Unit) is a processor designed for massive parallelism — running thousands of simple operations simultaneously rather than a few complex ones quickly. Originally built for rendering pixels, GPUs now power most machine learning workloads because ML is dominated by matrix math, which is inherently parallel.
This module explains why GPUs exist, how they differ from CPUs, and where CUDA fits in.
The CPU: Built for Speed
A modern CPU has 8-64 cores, each one a sophisticated processor. Every core packs:
- Branch prediction — guesses which code path to take next, so the pipeline never stalls
- Out-of-order execution — reorders instructions on the fly to keep all execution units busy
- Large caches — L1/L2/L3 caches (up to 256 MB total) to hide memory latency
- Deep pipelines — 15-20+ stages to maximize clock speed
All of this complexity serves one goal: execute a single thread of instructions as fast as possible.
A CPU core is like an expert chef — extremely skilled, equipped with every tool, can tackle any recipe. But there are only 8 of them in the kitchen.
Try it: On the right panel, click "Start" to see both CPU and GPU process the same 32 tasks simultaneously. Focus on the CPU side (left) — notice how 8 cores process tasks one at a time, while the GPU finishes almost instantly.
Where CPUs Shine
CPUs excel at tasks with complex control flow — operating systems, web servers, compilers, databases. Code with lots of if/else branches, pointer chasing, and unpredictable memory access patterns is where all that branch prediction and cache hierarchy pays off.
But what happens when your workload is 10,000 identical operations with no dependencies between them?
The GPU: Thousands of Simple Cores
The GPU: Built for Throughput
A GPU takes the opposite design approach. Instead of 8 sophisticated cores, a modern GPU has thousands of simple cores — an NVIDIA H100 has 16,896 CUDA cores.
Each GPU core is far simpler than a CPU core:
- No branch prediction — if threads diverge, both paths execute serially
- In-order execution — instructions run in the order written
- Tiny caches per core — latency is hidden by switching between thousands of threads, not by large caches
A GPU core is like a line cook — can only do one thing, but there are 16,000 of them working simultaneously.
Latency vs Throughput
This is the fundamental tradeoff:
| CPU | GPU | |
|---|---|---|
| Cores | 8-64 complex | 10,000+ simple |
| Optimized for | Latency (fast per-task) | Throughput (fast total) |
| Latency hiding | Large caches, speculation | Massive thread parallelism |
| Best at | Complex serial logic | Identical parallel operations |
Try it: Click "Start" on the right panel. Both CPU and GPU receive the same 32 tasks simultaneously — watch the GPU finish instantly while the CPU is still crawling through them one by one.
How GPUs Hide Latency
Fetching data from memory is slow — hundreds of clock cycles. CPUs and GPUs handle this wait very differently:
CPU approach: avoid the wait. CPUs use big caches (storing data nearby so it's fast to reach) and speculation (guessing what data you'll need next and fetching it early). When data isn't cached, the core stalls — it sits idle until the data arrives.
GPU approach: do something else while waiting. When one group of threads is waiting for memory, the GPU instantly switches to another group that's ready to compute. With thousands of threads queued up, there's always someone ready to work. No one waits idle.
Think of it like a restaurant kitchen. A CPU chef waits for ingredients to arrive before starting the next dish. A GPU kitchen has 1,000 cooks — while some wait for ingredients, hundreds of others are already cooking. The kitchen never stops.
ML Is Matrix Math
Why GPUs Dominate Machine Learning
Nearly every operation in a neural network is a matrix multiplication:
- Linear layers: output = input × weights (matmul)
- Attention: scores = Q × K^T (matmul), output = scores × V (matmul)
- Embedding lookup: technically a gather, but implemented as sparse matmul
- Convolutions: can be expressed as matmul via im2col
But what is matrix multiplication, and why is it so GPU-friendly?
Try it on the right panel: Click any cell in the output matrix C. You'll see its dot product — multiply each pair from a row of A and a column of B, then add them up. Click "Animate step by step" to watch the calculation unfold.
Why GPUs Love This
Each output cell is a completely independent calculation. Cell C[0][0] doesn't need to wait for C[0][1] — they use different row-column pairs. This means all cells can be computed at the same time.
See the difference: Switch to the "Matmul: CPU vs GPU" tab on the right panel. CPU computes one cell at a time. GPU computes all 16 simultaneously.
A 4×4 matrix has just 16 output cells. Real neural networks use 4096×4096 matrices — that's 16 million independent operations, all running simultaneously on GPU cores.
The Numbers
TFLOPS (Tera FLOPS) means trillions (1,000,000,000,000 — 12 zeros) of floating-point operations per second — it measures how much math a chip can crunch every second.
A single forward pass through a 7B parameter model involves roughly:
- 14 billion multiply-add operations per token
- At FP16: an H100 GPU delivers 990 TFLOPS of matrix math via Tensor Cores
- The same CPU might deliver 1-2 TFLOPS
That's a 500-1000x difference for matrix-heavy workloads.
When GPUs Don't Help
Not everything benefits from GPU parallelism:
- Small workloads — launching work on a GPU has overhead (~5-10 microseconds). If the computation itself takes less than that, the CPU wins.
- Sequential algorithms — sorting, tree traversal, and graph algorithms with data dependencies between steps can't exploit massive parallelism.
- Irregular memory access — random pointer chasing kills GPU performance because it can't coalesce memory reads.
Machine learning hits the sweet spot: large, regular, parallel matrix operations.
What Is CUDA?
What Is CUDA?
CUDA (Compute Unified Device Architecture) is NVIDIA's programming model for running general-purpose computations on GPUs. Released in 2007, it transformed GPUs from graphics-only hardware into programmable parallel processors.
Before CUDA, running non-graphics code on a GPU required disguising computations as pixel-shading operations — a painful hack. CUDA gave developers a C-like language to write GPU code directly.
Look at the right panel: The software stack from your Python code down to GPU hardware. Hover each layer to see what it does. The green "this track" labels show which layers this track teaches.
The Software Stack
When you write model(input) in PyTorch, here's what actually happens:
- PyTorch breaks your model into operations (matmul, softmax, layernorm...)
- torch.compile / Triton (optional) fuses operations and generates optimized GPU code
- CUDA libraries (cuBLAS, cuDNN) provide hand-tuned implementations of common operations
- CUDA driver compiles kernels to PTX (GPU assembly) and launches them
- GPU hardware executes the kernels across thousands of cores
Most ML engineers work at level 1 — they write Python and let PyTorch handle everything below. But understanding the lower layers helps you reason about why things are fast or slow.
Why NVIDIA Dominates
NVIDIA's moat isn't the hardware — it's the ecosystem:
- CUDA — the programming model (this track)
- cuBLAS / cuDNN — optimized math and neural network libraries
- TensorRT — inference optimization engine
- NCCL — multi-GPU communication library
- Triton — Python-level kernel language (originally from OpenAI, now widely adopted)
Every major ML framework (PyTorch, TensorFlow, JAX) is built on this stack. Competing hardware (AMD ROCm, Intel oneAPI) exists but lacks a decade of ecosystem maturity.
What This Track Teaches
You don't need to write CUDA kernels. But you do need to understand:
- Why some operations are slow (they're memory-bound, not compute-bound)
- Why FlashAttention is fast (it minimizes HBM round-trips)
- Why batch size matters (it improves GPU utilization)
- Why quantization helps (lower precision = higher Tensor Core throughput)
- How to read annotated CUDA code (not write production kernels)
The next 8 modules build this understanding, starting with how GPUs actually schedule and execute work.