What is the GPU & CUDA track?

Nine interactive modules covering why GPUs exist, the execution model, memory hierarchy, the roofline model, memory access patterns, tiling and matrix multiply, tensor cores, operator fusion, and Triton + torch.compile.

Do I need a GPU to take this track?

No — all modules are browser-based simulations.

Is this CUDA-specific or general GPU?

Both. The concepts are general GPU architecture; the code examples use CUDA and Triton.

What precision formats do Tensor Cores support?

Depending on GPU generation: TF32 (Ampere+), FP16, BF16, FP8 E4M3/E5M2 (Hopper+), INT8. Lower precision gives higher throughput — FP8 delivers ~32× the TFLOPS of FP32. Each generation adds support for lower precisions. Volta (V100) introduced FP16 Tensor Cores; Ampere (A100) added BF16 and TF32; Hopper (H100) added FP8 with two sub-formats (E4M3 for weights, E5M2 for gradients) plus the Transformer Engine that auto-scales between them. Blackwell extends this to FP4 for inference. The trend is clear: each generation buys more throughput by halving the precision for ops that tolerate it.

Why must matrix dimensions be multiples of 8 or 16?

Tensor Cores operate on fixed-size tiles (e.g., 16×16). If dimensions aren't aligned, the hardware pads with zeros — wasting compute. In practice, NVIDIA's libraries tile at 128 granularity, so multiples of 128 give the best utilization. This is why LLM hidden dimensions are almost always multiples of 64 or 128 (LLaMA 2 70B: 8192, GPT-3: 12288) and why padding a sequence up to the next multiple of 8 or 16 before a matmul can be faster than processing the raw size. Dimension alignment is a free 5–30% speedup when it's available.

What is loss scaling in mixed precision training?

FP16 can't represent very small gradient values — they underflow to zero. Loss scaling multiplies the loss by a large factor before backward pass, pushing all gradients into FP16's representable range via the chain rule. After conversion to FP32, the scale is divided out. BF16 often doesn't need loss scaling because its range matches FP32.

What is the difference between FP16 and BF16?

FP16 has 5 exponent + 10 mantissa bits (more precision, narrower range ±65504). BF16 has 8 exponent + 7 mantissa bits (same range as FP32 ±3.4×10³⁸, less precision). BF16 is the safer default for training — values almost never overflow or underflow. FP16's precision is better for inference where values are well-scaled and narrow, but during training BF16 avoids the loss-scaling dance required to keep gradients inside FP16's tiny range. That's why LLaMA, Mistral, and most frontier LLM training pipelines use BF16 end-to-end, while FP16 remains common for deployed inference.

Tensor Cores & Mixed Precision Explained

Beyond CUDA Cores

What are Tensor Cores?

Tensor Cores are specialized hardware units on NVIDIA GPUs that perform matrix multiply-accumulate (MMA) operations — computing a full 4×4 matrix result in a single clock cycle. Introduced in Volta (2017) and evolved every generation, they deliver 8-32× higher throughput than regular CUDA cores for matrix math, which is why they dominate modern ML training and inference.

MMA means: multiply two small matrices and add the result to a running total — D = A × B + C — all in one clock cycle. The "accumulate" (+C) is key: it lets the hardware keep a running sum across tiles, exactly like the sum += tileA * tileB loop from Module 6's tiled matmul. Without it, you'd need a separate addition step after every multiply.

Beyond CUDA Cores

Every NVIDIA GPU ships with two kinds of compute units: CUDA cores and Tensor Cores. For most of GPU history, CUDA cores did everything. Since 2017, Tensor Cores have taken over the inner loop of deep learning.

The Calculator vs. the Spreadsheet

A CUDA core is a calculator — it does one multiply-add per clock. Give it a * b + c, it computes one result and moves on.

A Tensor Core is a spreadsheet that fills a 4×4 block of results in one operation. Give it two 4×4 matrices A and B plus an accumulator C, it computes D = A × B + C — all 16 output elements — in a single clock cycle.

Matrix multiplication is the dominant operation in deep learning: attention, feedforward layers, embedding lookups, and convolutions all reduce to matmul. Tensor Cores are hardware built for exactly this operation.

CUDA Core vs Tensor Core — same 4×4 multiply

CUDA Core

1 result

per cycle

Tensor Core

16 results

per cycle

16× throughput for 4×4 matrix multiply

Throughput Across Generations

Each generation roughly doubles Tensor Core throughput. Internally, the number of threads cooperating on one MMA also grew: 8 threads (Volta) → 32 (Ampere) → 128 (Hopper) → 256 (Blackwell).

GPU	CUDA Cores FP32	Tensor Cores TF32	Tensor Cores FP16	Tensor Cores FP8
V100	15.7 TFLOPS	—	125 TFLOPS	—
A100	19.5 TFLOPS	156 TFLOPS	312 TFLOPS	—
H100	51 TFLOPS	495 TFLOPS	990 TFLOPS	1,979 TFLOPS
B200	~60 TFLOPS	~1,100 TFLOPS	~2,250 TFLOPS	~4,500 TFLOPS

On the right panel, watch the CUDA core fill a 4×4 grid one cell at a time (16 cycles), while the Tensor Core fills all 16 cells at once. Select different GPU generations to see throughput numbers scale.

Precision Formats

Precision Formats for Tensor Cores

Not every precision format works with Tensor Cores — and not every GPU supports every format. Choosing the right format is the key lever for trading accuracy against throughput.

TF32 — NVIDIA's Hybrid

TF32 (TensorFloat-32) is NVIDIA's marketing name — it's actually 19 bits, not 32. TF32 truncates FP32's mantissa from 23 bits to 10 before multiplication. You get FP32's number range with FP16-level precision — and 8× throughput. Since Ampere, TF32 is the default mode for FP32 matmuls on Tensor Cores.

The key point: TF32 is a Tensor Core internal format. You write float in your code. NVIDIA's math libraries silently convert to TF32 during matmul and convert back. You get the speedup with no code changes.

BF16 — The Safe 16-bit Format

BF16 (Brain Floating Point) was created at Google Brain. It has the same 8-bit exponent as FP32, so it covers the same magnitude range (±3.4×10³⁸). The tradeoff: only 7 mantissa bits vs FP32's 23. BF16 rarely overflows or underflows, making it the safest 16-bit format for training.

The naming origin: "BF" stands for "Brain Float" — developed for Google Brain's TPU research before becoming widely adopted across GPUs. A one-letter prefix that often causes confusion.

FP8 — Two Flavors for Two Jobs

FP8 comes in two flavors for different jobs. E4M3 (4 exponent + 3 mantissa) has more precision but narrower range (±448) — used for the forward pass. E5M2 (5 exponent + 2 mantissa) has wider range (±57,344) but less precision — used for the backward pass where gradient magnitudes span many orders.

The logic: during the forward pass, activations cluster in a predictable range and need precision. During the backward pass, gradients vary wildly in magnitude and need range more than precision.

On the right panel, select a format and see how 3.14159 gets represented. Watch precision decrease and rounding error grow as you move from FP32 → TF32 → FP16 → FP8.

Precision-Throughput Tradeoff

The Precision-Throughput Tradeoff

Lower precision = more operations per clock: FP32 (1×) → TF32 (8×) → FP16/BF16 (16×) → FP8 (32×). But lower precision = less accurate.

The engineering question: which layers can tolerate reduced precision?

What Can Run at Lower Precision?

Not every operation in a neural network has the same sensitivity to rounding error. Years of empirical research have produced reliable rules of thumb:

Most matmuls and convolutions: fine in FP16/BF16
Gradient accumulation and loss computation: need FP32
Attention scores: sensitive — BF16's wider range helps vs FP16
Embedding lookups: usually FP16-safe (discrete indices, not continuous)

BF16 Is the Safe Default

BF16 is the safe default. If you're unsure which precision to use, start with BF16. It has the same exponent range as FP32, so values almost never overflow or underflow. FP16 is slightly faster on some hardware but risks numerical instability. BF16 is becoming the standard replacement for FP32 in modern training.

The practical consequence: PyTorch's torch.autocast(device_type="cuda", dtype=torch.bfloat16) is the one-line change that unlocks Tensor Cores for your entire model. Most production training runs in BF16 today.

The Roofline Perspective

On the roofline, raising the compute ceiling doesn't help memory-bound operations. FP8 helps matmul (compute-bound) dramatically but barely affects elementwise ops (memory-bound).

Concretely: a large matrix multiplication with dim=8192 sits well above the roofline ridge point under FP32 — it's compute-bound. Switch to FP8 and the compute ceiling rises 32×, the operation moves up to fill it, and your throughput roughly doubles or more. But a layernorm or softmax is memory-bound at any precision — it's loading and storing data, not doing arithmetic. Changing precision doesn't move it at all.

This is the key insight for optimization: profile first, lower precision where it helps, leave memory-bound ops alone.

On the right panel, select different precisions and watch the compute ceiling rise. Notice: the matmul dot shifts from compute-bound under FP32 to memory-bound under FP8 — the bottleneck flips!

Dimension Alignment

Why Dimensions Must Be Multiples of 128

Tensor Cores operate on fixed-size tiles — 8×8, 16×16, or 16×8 depending on precision and GPU generation. If your matrix dimensions aren't multiples of these tile sizes, the hardware pads with zeros — wasted compute.

Tile Alignment

NVIDIA's libraries (cuBLAS, CUTLASS) tile at 128 or 256 granularity for maximum utilization. A 4000-dim embedding: 4000 mod 128 = 32 leftover elements in the last chunk, with 96 slots padded to zero. With dim=4096: 4096 mod 128 = 0, perfect fit.

Tile Sizes by Precision

Precision	Tensor Core Tile	Practical Alignment
FP16 / BF16	16×16	Multiple of 128
TF32	16×8	Multiple of 128
FP8	16×32	Multiple of 128
INT8	16×32	Multiple of 128

Why You See Powers of 2 Everywhere

This is why you see embedding dimensions like 4096, 8192, 12288 in real models — all multiples of 128. Not 4000 or 5000.

The same principle applies to:

Hidden dimensions — GPT-2: 768, 1024, 1280. LLaMA 2: 4096, 5120. All multiples of 128.
Attention heads — head count × head dimension = hidden dim. Both must align.
Vocabulary sizes — less critical (lookup, not matmul) but often padded for parallelism.
Batch sizes — not as strict, but multiples of 8 improve memory coalescing.

The Cost of Misalignment

The GPU doesn't error out on misaligned dimensions — it silently pads and wastes compute. A 4000-dim matmul runs at the same time as a 4096-dim matmul (same tile structure) but 2.4% of the work produces zeros you throw away.

At 512 tokens × 4000 dim: the wasted FLOPs are small. But in a large transformer trained for weeks on thousands of GPUs, those silent inefficiencies compound. Model architects care about this.

On the right panel, compare dim=4000 (pink padding at the edge) with dim=4096 (entirely green, perfect fit). The pink cells are zeros the GPU multiplies but you never use.

Mixed Precision Training & Inference

Mixing Precisions for Speed and Stability

In practice, you don't pick one precision — you mix them. Training uses FP16 for speed with FP32 for stability. Inference uses INT8/FP8 for throughput.

Mixed Precision Training

The core idea: run the expensive matmuls in FP16/BF16 (fast Tensor Core path), but accumulate gradients and update weights in FP32 (stable, no precision loss). PyTorch's torch.autocast manages this automatically — you tell it which dtype to use and it selects FP16/BF16 for supported ops, FP32 for everything else.

The Gradient Underflow Problem

FP16 can represent numbers as small as ~6×10⁻⁸. But gradient values during training can be as small as 10⁻²⁰ — they underflow to zero in FP16.

When gradients underflow, weight updates stop. The model silently stops learning. You won't see an error — you'll just see loss stop decreasing.

FP16 gradient underflow — and how loss scaling fixes it

Scaled → in range ✓

Loss Scaling: The Fix

Loss scaling multiplies the loss by a large factor (e.g., 1024) before the backward pass. By the chain rule, all gradients are scaled too — pushed into FP16's representable range. After converting back to FP32 for the weight update, divide by the scale factor.

The scale factor is tuned dynamically: if gradients overflow (producing inf or nan), halve the scale and skip the update. If training is stable for N consecutive steps, double the scale. This is dynamic loss scaling — PyTorch's GradScaler implements it.

BF16 Often Doesn't Need Loss Scaling

BF16 often doesn't need loss scaling. BF16 has the same exponent range as FP32, so gradients almost never underflow. If you train with BF16 mixed precision, you can often skip loss scaling entirely. FP16 always needs it.

This is one of the main reasons the ML community has shifted from FP16 to BF16 as the default mixed precision format — one fewer moving part.

Mixed precision: Training vs Inference

Training

Master Weights

FP32

cast ↓

Forward Pass

FP16/BF16

Loss × Scale

FP16/BF16

Backward Pass

FP16/BF16

÷ scale, cast ↑

Weight Update

FP32

↺ repeat

Inference

Weights

INT8 / FP8

Activations

FP16

Tensor Core MMA

mixed

Accumulator

FP32

cast ↓

Output

FP16

The wmma API

When you need direct Tensor Core access (in a custom CUDA kernel), the wmma (warp matrix multiply-accumulate) API exposes the hardware:

wmma::fragment<matrix_a, 16, 16, 16, half> a_frag; // FP16 input

wmma::fragment<matrix_b, 16, 16, 16, half> b_frag; // FP16 input

wmma::fragment<accumulator, 16, 16, 16, float> c_frag; // FP32 accum

wmma::load_matrix_sync(a_frag, A, stride);

wmma::load_matrix_sync(b_frag, B, stride);

wmma::fill_fragment(c_frag, 0.0f);

wmma::mma_sync(c_frag, a_frag, b_frag, c_frag); // FP16×FP16+FP32

wmma::store_matrix_sync(C, c_frag, stride);

The fragments are FP16 inputs with a FP32 accumulator — this is the standard mixed precision pattern at the hardware level. In practice, you'd use cuBLAS or CUTLASS rather than wmma directly, but the API shows exactly what the hardware is doing.

Connection to Other Modules

The Quantization module in the LLM Internals track covers how INT8/FP8 weights are produced (GPTQ, AWQ, NF4). This module shows the hardware side — how Tensor Cores consume those quantized weights.

What's Next

Module 8 introduces Operator Fusion & FlashAttention — Tensor Cores give you raw throughput. Fusion ensures you're not wasting it on HBM round-trips between operations.