What is the GPU & CUDA track?

Nine interactive modules covering why GPUs exist, the execution model, memory hierarchy, the roofline model, memory access patterns, tiling and matrix multiply, tensor cores, operator fusion, and Triton + torch.compile.

Do I need a GPU to take this track?

No — all modules are browser-based simulations.

Is this CUDA-specific or general GPU?

Both. The concepts are general GPU architecture; the code examples use CUDA and Triton.

Triton is a Python-based language for writing GPU kernels, developed by OpenAI. Unlike CUDA where you manage individual threads, Triton operates on blocks of data and lets the compiler handle shared memory, coalescing, and synchronization automatically. A Triton kernel looks like NumPy-style code annotated with block shapes — you declare @triton.jit on a function, call it with a launch grid, and the compiler lowers it through LLVM to PTX. Triton backs torch.compile's generated kernels and is widely used to hand-write kernels for research (FlashAttention originally had a Triton port) that would otherwise require weeks of CUDA work.

How is Triton different from CUDA?

CUDA requires manual management of threads, shared memory, synchronization, coalescing, and bank conflicts. Triton abstracts these — you define block-level operations in Python and the compiler generates optimized GPU code. Triton kernels are typically 3-5x shorter than equivalent CUDA. The tradeoff is that the compiler decides the thread-level layout, so some low-level tricks — warp specialization, manual double-buffering, inline PTX — are unavailable or awkward. For 90% of kernels the abstraction wins; for the hottest 10% (GEMM, attention, collectives) hand-written CUDA or CUTLASS still sets the bar, and Triton is working to close the gap each release.

When should I use Triton vs CUDA vs torch.compile?

torch.compile: first choice for most users (free speedup, no code changes). Triton: custom kernels for research or operations torch.compile can't fuse. CUDA: maximum control for library developers or when Triton's compiler can't match hand-tuned performance. In practice, most teams start with torch.compile, drop to Triton for a specific hot kernel the compiler missed, and reach for raw CUDA only when competing with NVIDIA's own libraries. FlashAttention, Mamba, and most novel research kernels use Triton first because rapid iteration matters; cuBLAS-class GEMM kernels stay in hand-tuned CUDA/CUTLASS.

Can Triton match CUDA performance?

For many common operations (matmul, softmax, attention), Triton achieves 80-100% of hand-tuned CUDA performance. For highly specialized kernels with complex memory access patterns, expert CUDA can still be faster. The gap is narrowing with each Triton release. New Hopper and Blackwell features — warp specialization, TMA async copies, FP8 Tensor Cores — typically show up in CUDA/CUTLASS first and land in Triton a few releases later. For a team that would otherwise hire a CUDA specialist to hand-tune one kernel, Triton often delivers 90%+ of the performance at 10% of the engineering cost, which is why it now backs both torch.compile and many production serving stacks.

Triton & torch.compile — Python GPU Programming

The CUDA Expertise Gap

What is Triton?

Triton is a Python-based language for writing GPU kernels — code that runs directly on the GPU at near-CUDA speed, but with far less manual effort. Created by Philippe Tillet at OpenAI in 2021, Triton lets you write high-performance GPU code without managing the low-level details that make CUDA difficult.

Think of it this way: CUDA is like driving a manual transmission race car — maximum control, but you manage the clutch, the gear shifts, and the racing line yourself. Triton is like a semi-automatic — you steer and decide the strategy, but the gearbox handles itself.

The Manual Burden of CUDA

Throughout this track, you've learned the concepts that CUDA programmers must manage by hand:

Thread indexing (Module 2) — computing blockIdx.x * blockDim.x + threadIdx.x for every memory access
Memory coalescing (Module 5) — ensuring adjacent threads read adjacent addresses, or paying a huge bandwidth penalty
Shared memory (Module 5) — declaring __shared__ buffers, calling __syncthreads() at exactly the right places, avoiding bank conflicts
Tiling (Module 6) — loading data into SRAM in chunks, managing tile boundaries, accumulating partial results
Synchronization — placing __syncthreads() barriers so threads don't read stale data

Each concept alone is manageable. But a real CUDA kernel — even something as simple as softmax — requires combining all of them simultaneously. That's where the expertise gap appears: knowing each concept is not the same as correctly orchestrating them together.

Block-Level vs Thread-Level Programming

Triton's key insight: work at the block level, not the thread level. To understand what this means, let's first recall how the GPU organizes work (from Module 2):

A thread is the smallest unit of execution on the GPU. Each thread runs the same program but on different data — like one worker on an assembly line handling one item.
A block (also called a thread block) is a group of threads that run together on the same SM (streaming multiprocessor). Threads within a block can share fast on-chip memory (SRAM) and synchronize with each other. A typical block has 128-1024 threads.
The GPU launches many blocks in a grid. Each block is assigned to an SM, and the SM runs all the threads inside that block.

So the hierarchy is: grid → blocks → threads. A grid might have 100 blocks, each block has 256 threads, giving 25,600 threads total — all running the same program on different pieces of data.

Your code creates:

Grid

256 threads

→GPU
assigns

GPU hardware runs:

SM0

B0B3

SM1

B1B4

SM2

B2B5

You create blocks — the GPU decides which SM runs each one

Now here's the difference in how CUDA and Triton think about this structure:

CUDA (thread-level): You write a program for one thread. Each thread handles one element. You launch 1024 threads, and every thread computes its own index to figure out which element to touch:

// CUDA: you think about ONE element at a time
int i = blockIdx.x * blockDim.x + threadIdx.x;  // "which element am I?"
if (i < N) output[i] = input[i] + 1;            // process that one element

You're responsible for: computing the index correctly, making sure threads don't go out of bounds, ensuring adjacent threads read adjacent memory (coalescing), and coordinating if threads need to share results (__syncthreads__).

Triton (block-level): You write a program for a whole block of elements. Each program instance handles, say, 256 elements at once. You don't think about individual threads — you think about chunks of data:

# Triton: you think about a BLOCK of elements at a time
pid = tl.program_id(0)                              # "which block am I?"
offsets = pid * BLOCK + tl.arange(0, BLOCK)          # all 256 element positions
mask = offsets < N                                   # boundary check
data = tl.load(input_ptr + offsets, mask=mask)       # load all 256 at once
tl.store(output_ptr + offsets, data + 1, mask=mask)  # write all 256 at once

The compiler decides: how to split those 256 elements across threads within a block, how to coalesce the memory accesses, and whether to use shared memory. You just say "load this block, process it, store it."

The key difference: in CUDA, you are the thread — you compute one element and must coordinate with other threads. In Triton, you are the block — you describe what happens to a chunk of data, and the compiler handles the per-thread details.

This means:

No threadIdx.x — you use tl.program_id(0) to identify which block of work you're handling
No __syncthreads() — the compiler inserts synchronization where needed
No shared memory declarations — the compiler decides what goes in shared memory
No coalescing concerns — the compiler arranges memory access patterns automatically

The Abstraction Stack

Triton sits in the middle of a stack of tools, each trading ease for control:

GPU Abstraction Stack

PyTorch EagerL4

Write Python ops — runs immediately, one kernel per op

torch.compileL3

Annotate with @torch.compile — compiler fuses & optimizes for you

TritonL2

Write tile-level kernels in Python — full control, no C++ needed

CUDA C++L1

Write thread-level kernels — maximum control, maximum complexity

↑ easier↓ more control

The rest of this module walks through each level: how Triton simplifies code (Step 2), what the compiler does behind the scenes (Step 3), how torch.compile goes even further (Step 4), and when to use which tool (Step 5).

On the right panel: Click each level of the abstraction ladder to see what you control vs. what's automated. Notice how the lower you go, the more modules from this track become relevant — those are the concepts you'd have to manage manually.

Triton by Example

Fused Softmax: The Canonical Example

The best way to understand Triton is to see it in action. We'll look at fused softmax — the same operation from Module 8 where we saw FlashAttention eliminate HBM round-trips.

Quick recap: softmax turns a row of raw scores into probabilities that sum to 1. It requires finding the max (for numerical stability), subtracting it, exponentiating, summing, and dividing. In Module 8, we saw that fusing these steps into one kernel keeps intermediates in SRAM instead of writing them to HBM between each step.

The Triton Kernel

Here's the complete fused softmax kernel in Triton (~15 lines of actual logic):

@triton.jit
def softmax_kernel(input_ptr, output_ptr, stride, n_cols, BLOCK: tl.constexpr):
    row_idx = tl.program_id(0)              # which row am I?
    offsets = tl.arange(0, BLOCK)            # column offsets for this block
    mask = offsets < n_cols                   # boundary check

    # Load entire row into SRAM
    row = tl.load(input_ptr + row_idx * stride + offsets, mask=mask, other=-float('inf'))

    # Softmax: max → subtract → exp → sum → divide
    row_max = tl.max(row, axis=0)            # find max (compiler handles reduction)
    numerator = tl.exp(row - row_max)        # subtract-max trick + exponentiate
    denominator = tl.sum(numerator, axis=0)  # sum (compiler handles reduction)

    # Write result
    tl.store(output_ptr + row_idx * stride + offsets, numerator / denominator, mask=mask)

Let's unpack the key lines:

tl.program_id(0) — "which row am I processing?" This is the block-level equivalent of CUDA's blockIdx.x. No thread math needed.
tl.arange(0, BLOCK) — generates a range of column offsets [0, 1, 2, ..., BLOCK-1]. The compiler maps these to individual threads automatically.
mask = offsets < n_cols — handles rows that aren't exact multiples of BLOCK. Masked positions read -inf (which becomes 0 after softmax).
tl.load(...) and tl.store(...) — load and write data. The compiler generates coalesced memory accesses automatically.
tl.max(...) and tl.sum(...) — reductions across the block. The compiler generates shared memory + warp shuffle reductions + __syncthreads(). You write one line; CUDA would need ~10 lines for each.

What Would This Look Like in CUDA?

The same fused softmax in CUDA requires ~40+ lines: pointer arithmetic, shared memory declarations, explicit __syncthreads(), warp shuffle reduction loops, and manual boundary checking. Every concept from Modules 2, 5, and 6 shows up.

Lines of Code: Same Kernel

~15 lines

Triton

~45 lines

CUDA C++

Triton eliminates thread-level boilerplate — same GPU performance.

The line count isn't the whole story. Each CUDA line is also harder to get right — a misplaced __syncthreads() or a coalescing mistake silently degrades performance without any error.

On the right panel: Click "Show CUDA equivalent" to watch each Triton line expand into its CUDA equivalent. The line counter grows from 15 to 40+ as the hidden complexity reveals itself. Notice the automation badges at the bottom — each one is a concept from a previous module.

How Triton Works

What Happens Between Python and GPU?

You've seen that Triton code is much simpler than CUDA. But the GPU still runs the same low-level operations — threads, shared memory, coalesced loads. The difference is who writes that code: you, or the compiler.

When you run a Triton kernel, your Python code goes through a multi-stage compilation pipeline before the GPU executes anything. The details of each stage don't matter — you'll never read intermediate compiler output. What matters is what the compiler decides at each stage, because those decisions are exactly the concepts you learned in Modules 2-7.

The Compilation Stages (Brief Overview)

Your Triton Python code passes through 5 stages:

Python → Triton IR — your decorated function is parsed into an intermediate representation
Triton IR → Triton GPU IR — hardware-specific optimizations: memory layout, warp scheduling
Triton GPU IR → LLVM IR — converted to LLVM's standard representation
LLVM IR → PTX — NVIDIA's assembly language for GPUs
PTX → GPU binary — final machine code the GPU actually executes

The result is cached on disk, so subsequent runs skip compilation entirely.

What the Compiler Decides

This is the important part. Each decision maps to a concept from a previous module:

Thread distribution (Module 2: Execution Model)

You write tl.program_id(0) and tl.arange(0, BLOCK) — block-level identifiers. The compiler turns these into specific thread assignments: which thread handles which element, how threads are grouped into warps, how warps are assigned to SMs.

Memory coalescing (Module 5: Memory Access Patterns)

You write tl.load(ptr + offsets). The compiler analyzes the access pattern and generates vectorized load instructions (ldg.128) that read 128 bytes at a time. It ensures adjacent threads read adjacent memory addresses — the coalescing pattern from Module 5 — without you thinking about it.

Shared memory + synchronization (Modules 5-6: Shared Memory, Tiling)

You write tl.max(row, axis=0) or tl.sum(...). The compiler:

Allocates shared memory for the reduction
Inserts __syncthreads() barriers at the right places
Generates warp shuffle instructions for the final reduction steps
Handles all the tree reduction logic from Module 6

In CUDA, each of these requires ~10 lines of careful code. In Triton, it's one line.

Tile size selection (Module 6: Tiling)

The @triton.autotune decorator defines a search space of configurations — different tile sizes, number of warps, pipeline stages. Triton benchmarks each configuration on your actual GPU and caches the fastest one. This is why Triton's first run on a new shape is slow (benchmarking), but subsequent runs are fast (cached result).

Tensor core mapping (Module 7: Tensor Cores)

For matrix operations, tl.dot(a, b) maps directly to hardware MMA (matrix multiply-accumulate) instructions — the Tensor Core operations from Module 7. The compiler handles accumulator precision and tile dimensions automatically.

On the right panel: Hover over the colored badges to see which lines of Triton code each compiler decision affects. Toggle to "Matmul" to see the tensor core mapping with tl.dot(). The module connection strip at the bottom shows how this step connects back to what you already know.

torch.compile

What is torch.compile?

Triton lets you write GPU kernels in Python — but you still need to write a kernel. You choose the algorithm, the tile sizes, and the memory access pattern. torch.compile goes one step further: it watches your regular PyTorch code run, figures out what can be optimized, and generates Triton kernels automatically.

Added in PyTorch 2.0, torch.compile is a single decorator that can speed up your model with no algorithm changes. Same code, same math — but under the hood, torch.compile traces the operations, fuses them where possible (using the fusion rules from Module 8), and generates optimized Triton kernels.

The Three-Stage Pipeline

Stage 1: TorchDynamo (Tracing)

Dynamo intercepts your Python code as it runs and records every tensor operation into a graph — an FX Graph. Think of it as a camera recording what your model does, operation by operation.

For y = relu(layernorm(x @ W + b)), Dynamo captures: matmul → add → layernorm → relu. These are the graph nodes.

Graph breaks: If your code has data-dependent control flow — like if x.sum() > 0: — the tracer can't record it as a static graph. This causes a "graph break" that forces separate compilation of each segment, reducing optimization opportunities.

Stage 2: AOT Autograd + TorchInductor (Fusion)

Inductor looks at the traced graph and applies fusion rules. Remember from Module 8: elementwise operations fuse freely, reductions are barriers.

In our example: add (elementwise), layernorm (reduction + elementwise), and relu (elementwise) get fused into a single kernel. The matmul stays separate — cuBLAS already handles it optimally.

Result: 4 separate kernel launches → 2 (one cuBLAS matmul + one fused Triton kernel).

Stage 3: Codegen (Triton Kernel Generation)

For each fused group, Inductor generates a Triton kernel. The add + layernorm + relu fusion becomes a single Triton function that loads input once, applies all three operations, and writes the final result — no intermediate HBM round-trips.

When torch.compile Helps

Standard tensor operations that follow fusion rules (pointwise chains, reduction + normalize patterns)
Repeated execution — compilation cost is amortized over many forward passes
Many small operations — fusion reduces kernel launch overhead

When It Doesn't Help

First call — compilation takes seconds (sometimes minutes for large models). Subsequent calls are cached.
Data-dependent control flow — if tensor.item() > 0: causes graph breaks. The tracer can only record static operation patterns.
Already-optimized ops — cuBLAS matmul and FlashAttention are already near-optimal. torch.compile won't improve them.
Dynamic shapes — different input sizes trigger recompilation. Use dynamic=True to mitigate.

On the right panel: Step through the 3-phase pipeline. Watch Dynamo trace ops into graph nodes, Inductor group them into fused blocks (notice the fusion rules from Module 8), and Codegen produce a Triton kernel. The before/after strip shows the concrete improvement: 4 launches → 2.

When to Use What

The Right Tool for the Problem

You now understand GPU programming from silicon to Python. The question isn't which tool is best — it's which tool fits the problem.

Here are the four levels, from easiest to most effort:

PyTorch eager — maximum flexibility, zero optimization. Your starting point for prototyping, debugging, and correctness testing. Every tensor operation launches a separate CUDA kernel.

torch.compile — one decorator, automatic fusion and Triton codegen. The default first step for production. Covers most optimization opportunities without writing any GPU code.

Triton — custom Python GPU kernels. Write your own fused operations when torch.compile can't fuse your pattern, or when you need specific control over tiling and memory access. Still Python — no C++ needed.

CUDA — maximum control, maximum effort. Only justified when Triton's ~80% performance ceiling isn't enough for your specific workload. For memory-bound operations, Triton already matches CUDA; the gap only appears in compute-bound kernels.

The Decision Process

The flowchart is simple:

1. Is your model fast enough? If yes, stop. Optimization without a bottleneck is wasted effort. Use profiling tools (like torch.profiler) to identify where time is actually spent.

2. Have you tried torch.compile? If not, start here. One line of code, and it handles fusion automatically. This resolves the majority of cases.

3. Is the bottleneck a standard operation? Standard operations like matrix multiplication and attention already have optimized library implementations (cuBLAS, FlashAttention). No custom kernel needed — just make sure you're using the library version.

4. Is your custom operation memory-bound or compute-bound? This is the roofline question from Module 4 — the thread that ties this entire track together.

Memory-bound (low arithmetic intensity): Triton matches CUDA performance. The bottleneck is HBM bandwidth, not compute, so Triton's compiler generates equally efficient memory access patterns. Use Triton.
Compute-bound (high arithmetic intensity): Triton reaches ~80% of CUDA performance. The gap comes from CUDA's ability to fine-tune warp-level scheduling and register usage. Write Triton first — only drop to CUDA if that 20% gap matters for your specific use case.

The Full Journey

This track has taken you from "why are GPUs fast?" to "which abstraction level should I use?" Here's how it all connects:

Module 1 (Why GPUs) explained that GPUs are fast because they have thousands of cores optimized for parallel work — the fundamental reason this entire stack exists.

Modules 2-3 (Execution Model, Memory Hierarchy) showed how the GPU organizes those cores (threads → warps → blocks → SMs) and where data lives (registers → shared memory/SRAM → HBM).

Module 4 (Roofline Model) gave you the universal diagnostic: is your kernel memory-bound or compute-bound? This question appears at every level of the abstraction stack.

Modules 5-6 (Memory Patterns, Tiling) taught the two most impactful optimizations: accessing memory in coalesced patterns and reusing data through tiling. These are exactly what Triton's compiler automates.

Module 7 (Tensor Cores) showed how specialized hardware accelerates matrix math — what tl.dot() maps to behind the scenes.

Module 8 (Operator Fusion) explained why fusing kernels matters — the principle that torch.compile applies automatically.

Module 9 (this module) connected all of it: Triton automates Modules 2-7, torch.compile automates Module 8's fusion, and the roofline model (Module 4) tells you when to go deeper.

On the right panel: Click through the decision tree to explore different optimization paths. Each recommendation highlights which modules are relevant. The bottom strip shows all 9 modules — watch which ones light up as you navigate different branches.