Learn AI VisuallyTracksAI Explained

Tensor Cores & Mixed Precision Explained

Beyond CUDA Cores

What are Tensor Cores?

Tensor Cores are specialized hardware units on NVIDIA GPUs that perform matrix multiply-accumulate (MMA) operations — computing a full 4×4 matrix result in a single clock cycle. Introduced in Volta (2017) and evolved every generation, they deliver 8-32× higher throughput than regular CUDA cores for matrix math, which is why they dominate modern ML training and inference.

MMA means: multiply two small matrices and add the result to a running total — D = A × B + C — all in one clock cycle. The "accumulate" (+C) is key: it lets the hardware keep a running sum across tiles, exactly like the sum += tileA * tileB loop from Module 6's tiled matmul. Without it, you'd need a separate addition step after every multiply.

Beyond CUDA Cores

Every NVIDIA GPU ships with two kinds of compute units: CUDA cores and Tensor Cores. For most of GPU history, CUDA cores did everything. Since 2017, Tensor Cores have taken over the inner loop of deep learning.

The Calculator vs. the Spreadsheet

A CUDA core is a calculator — it does one multiply-add per clock. Give it a * b + c, it computes one result and moves on.

A Tensor Core is a spreadsheet that fills a 4×4 block of results in one operation. Give it two 4×4 matrices A and B plus an accumulator C, it computes D = A × B + C — all 16 output elements — in a single clock cycle.

Matrix multiplication is the dominant operation in deep learning: attention, feedforward layers, embedding lookups, and convolutions all reduce to matmul. Tensor Cores are hardware built for exactly this operation.

CUDA Core vs Tensor Core — same 4×4 multiply
CUDA Core
1 result
per cycle
vs
Tensor Core
16 results
per cycle
16× throughput for 4×4 matrix multiply

Throughput Across Generations

Each generation roughly doubles Tensor Core throughput. Internally, the number of threads cooperating on one MMA also grew: 8 threads (Volta) → 32 (Ampere) → 128 (Hopper) → 256 (Blackwell).

GPUCUDA Cores FP32Tensor Cores TF32Tensor Cores FP16Tensor Cores FP8
V10015.7 TFLOPS—125 TFLOPS—
A10019.5 TFLOPS156 TFLOPS312 TFLOPS—
H10051 TFLOPS495 TFLOPS990 TFLOPS1,979 TFLOPS
B200~60 TFLOPS~1,100 TFLOPS~2,250 TFLOPS~4,500 TFLOPS

On the right panel, watch the CUDA core fill a 4×4 grid one cell at a time (16 cycles), while the Tensor Core fills all 16 cells at once. Select different GPU generations to see throughput numbers scale.

Precision Formats

Precision Formats for Tensor Cores

Not every precision format works with Tensor Cores — and not every GPU supports every format. Choosing the right format is the key lever for trading accuracy against throughput.

bit layout (proportional width)FP32exp8bmantissa23bFull range + full precisionTF32exp8bmant10bTensorFloat-32 — FP32 range, FP16 precision, 8× throughputBF16exp8bmant7bSame range as FP32, less precisionFP16exp5bmantissa10bSmaller exp → can overflowFP8 E4M3exp4bmant3bForward pass — more precision, range ±448FP8 E5M2exp5bmant2bBackward pass — wider range ±57,344, less precisionINT8value (integer)8b256 levels, no float encoding
■ exponent → determines range  · ■ mantissa → determines precision  ·  TF32 = FP32 range + FP16 precision  ·  FP8 E4M3 = forward pass  ·  FP8 E5M2 = backward pass

TF32 — NVIDIA's Hybrid

TF32 (TensorFloat-32) is NVIDIA's marketing name — it's actually 19 bits, not 32. TF32 truncates FP32's mantissa from 23 bits to 10 before multiplication. You get FP32's number range with FP16-level precision — and 8× throughput. Since Ampere, TF32 is the default mode for FP32 matmuls on Tensor Cores.

The key point: TF32 is a Tensor Core internal format. You write float in your code. NVIDIA's math libraries silently convert to TF32 during matmul and convert back. You get the speedup with no code changes.

BF16 — The Safe 16-bit Format

BF16 (Brain Floating Point) was created at Google Brain. It has the same 8-bit exponent as FP32, so it covers the same magnitude range (±3.4×10³⁸). The tradeoff: only 7 mantissa bits vs FP32's 23. BF16 rarely overflows or underflows, making it the safest 16-bit format for training.

The naming origin: "BF" stands for "Brain Float" — developed for Google Brain's TPU research before becoming widely adopted across GPUs. A one-letter prefix that often causes confusion.

FP8 — Two Flavors for Two Jobs

FP8 comes in two flavors for different jobs. E4M3 (4 exponent + 3 mantissa) has more precision but narrower range (±448) — used for the forward pass. E5M2 (5 exponent + 2 mantissa) has wider range (±57,344) but less precision — used for the backward pass where gradient magnitudes span many orders.

The logic: during the forward pass, activations cluster in a predictable range and need precision. During the backward pass, gradients vary wildly in magnitude and need range more than precision.

On the right panel, select a format and see how 3.14159 gets represented. Watch precision decrease and rounding error grow as you move from FP32 → TF32 → FP16 → FP8.

Precision-Throughput Tradeoff

The Precision-Throughput Tradeoff

Lower precision = more operations per clock: FP32 (1×) → TF32 (8×) → FP16/BF16 (16×) → FP8 (32×). But lower precision = less accurate.

The engineering question: which layers can tolerate reduced precision?

What Can Run at Lower Precision?

Not every operation in a neural network has the same sensitivity to rounding error. Years of empirical research have produced reliable rules of thumb:

  • Most matmuls and convolutions: fine in FP16/BF16
  • Gradient accumulation and loss computation: need FP32
  • Attention scores: sensitive — BF16's wider range helps vs FP16
  • Embedding lookups: usually FP16-safe (discrete indices, not continuous)

BF16 Is the Safe Default

BF16 is the safe default. If you're unsure which precision to use, start with BF16. It has the same exponent range as FP32, so values almost never overflow or underflow. FP16 is slightly faster on some hardware but risks numerical instability. BF16 is becoming the standard replacement for FP32 in modern training.

The practical consequence: PyTorch's torch.autocast(device_type="cuda", dtype=torch.bfloat16) is the one-line change that unlocks Tensor Cores for your entire model. Most production training runs in BF16 today.

The Roofline Perspective

On the roofline, raising the compute ceiling doesn't help memory-bound operations. FP8 helps matmul (compute-bound) dramatically but barely affects elementwise ops (memory-bound).

Concretely: a large matrix multiplication with dim=8192 sits well above the roofline ridge point under FP32 — it's compute-bound. Switch to FP8 and the compute ceiling rises 32×, the operation moves up to fill it, and your throughput roughly doubles or more. But a layernorm or softmax is memory-bound at any precision — it's loading and storing data, not doing arithmetic. Changing precision doesn't move it at all.

This is the key insight for optimization: profile first, lower precision where it helps, leave memory-bound ops alone.

On the right panel, select different precisions and watch the compute ceiling rise. Notice: the matmul dot shifts from compute-bound under FP32 to memory-bound under FP8 — the bottleneck flips!

Dimension Alignment

Why Dimensions Must Be Multiples of 128

Tensor Cores operate on fixed-size tiles — 8×8, 16×16, or 16×8 depending on precision and GPU generation. If your matrix dimensions aren't multiples of these tile sizes, the hardware pads with zeros — wasted compute.

Tile Alignment

NVIDIA's libraries (cuBLAS, CUTLASS) tile at 128 or 256 granularity for maximum utilization. A 4000-dim embedding: 4000 mod 128 = 32 leftover elements in the last chunk, with 96 slots padded to zero. With dim=4096: 4096 mod 128 = 0, perfect fit.

Tile Sizes by Precision

PrecisionTensor Core TilePractical Alignment
FP16 / BF1616×16Multiple of 128
TF3216×8Multiple of 128
FP816×32Multiple of 128
INT816×32Multiple of 128

Why You See Powers of 2 Everywhere

This is why you see embedding dimensions like 4096, 8192, 12288 in real models — all multiples of 128. Not 4000 or 5000.

The same principle applies to:

  • Hidden dimensions — GPT-2: 768, 1024, 1280. LLaMA 2: 4096, 5120. All multiples of 128.
  • Attention heads — head count × head dimension = hidden dim. Both must align.
  • Vocabulary sizes — less critical (lookup, not matmul) but often padded for parallelism.
  • Batch sizes — not as strict, but multiples of 8 improve memory coalescing.

The Cost of Misalignment

The GPU doesn't error out on misaligned dimensions — it silently pads and wastes compute. A 4000-dim matmul runs at the same time as a 4096-dim matmul (same tile structure) but 2.4% of the work produces zeros you throw away.

At 512 tokens × 4000 dim: the wasted FLOPs are small. But in a large transformer trained for weeks on thousands of GPUs, those silent inefficiencies compound. Model architects care about this.

On the right panel, compare dim=4000 (pink padding at the edge) with dim=4096 (entirely green, perfect fit). The pink cells are zeros the GPU multiplies but you never use.

Mixed Precision Training & Inference

Mixing Precisions for Speed and Stability

In practice, you don't pick one precision — you mix them. Training uses FP16 for speed with FP32 for stability. Inference uses INT8/FP8 for throughput.

Mixed Precision Training

The core idea: run the expensive matmuls in FP16/BF16 (fast Tensor Core path), but accumulate gradients and update weights in FP32 (stable, no precision loss). PyTorch's torch.autocast manages this automatically — you tell it which dtype to use and it selects FP16/BF16 for supported ops, FP32 for everything else.

The Gradient Underflow Problem

FP16 can represent numbers as small as ~6×10⁻⁸. But gradient values during training can be as small as 10⁻²⁰ — they underflow to zero in FP16.

When gradients underflow, weight updates stop. The model silently stops learning. You won't see an error — you'll just see loss stop decreasing.

FP16 gradient underflow — and how loss scaling fixes it
6×10⁻⁸65504FP16 range~10⁻²⁰gradient→ 0 ✗× 1024 (loss scale)scaled
Scaled → in range ✓

Loss Scaling: The Fix

Loss scaling multiplies the loss by a large factor (e.g., 1024) before the backward pass. By the chain rule, all gradients are scaled too — pushed into FP16's representable range. After converting back to FP32 for the weight update, divide by the scale factor.

The scale factor is tuned dynamically: if gradients overflow (producing inf or nan), halve the scale and skip the update. If training is stable for N consecutive steps, double the scale. This is dynamic loss scaling — PyTorch's GradScaler implements it.

BF16 Often Doesn't Need Loss Scaling

BF16 often doesn't need loss scaling. BF16 has the same exponent range as FP32, so gradients almost never underflow. If you train with BF16 mixed precision, you can often skip loss scaling entirely. FP16 always needs it.

This is one of the main reasons the ML community has shifted from FP16 to BF16 as the default mixed precision format — one fewer moving part.

Mixed precision: Training vs Inference
Training
Master Weights
FP32
cast ↓
Forward Pass
FP16/BF16
Loss × Scale
FP16/BF16
Backward Pass
FP16/BF16
÷ scale, cast ↑
Weight Update
FP32
↺ repeat
vs
Inference
Weights
INT8 / FP8
+
Activations
FP16
Tensor Core MMA
mixed
Accumulator
FP32
cast ↓
Output
FP16

The wmma API

When you need direct Tensor Core access (in a custom CUDA kernel), the wmma (warp matrix multiply-accumulate) API exposes the hardware:

wmma::fragment<matrix_a, 16, 16, 16, half> a_frag; // FP16 input
wmma::fragment<matrix_b, 16, 16, 16, half> b_frag; // FP16 input
wmma::fragment<accumulator, 16, 16, 16, float> c_frag; // FP32 accum
wmma::load_matrix_sync(a_frag, A, stride);
wmma::load_matrix_sync(b_frag, B, stride);
wmma::fill_fragment(c_frag, 0.0f);
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag); // FP16×FP16+FP32
wmma::store_matrix_sync(C, c_frag, stride);

The fragments are FP16 inputs with a FP32 accumulator — this is the standard mixed precision pattern at the hardware level. In practice, you'd use cuBLAS or CUTLASS rather than wmma directly, but the API shows exactly what the hardware is doing.

Connection to Other Modules

The Quantization module in the LLM Internals track covers how INT8/FP8 weights are produced (GPTQ, AWQ, NF4). This module shows the hardware side — how Tensor Cores consume those quantized weights.

What's Next

Module 8 introduces Operator Fusion & FlashAttention — Tensor Cores give you raw throughput. Fusion ensures you're not wasting it on HBM round-trips between operations.

Frequently Asked Questions

© 2026 Learn AI Visuallycraftsman@craftsmanapps.com