Learn AI VisuallyTracksAI Explained

Triton & torch.compile — Python GPU Programming

The CUDA Expertise Gap

What is Triton?

Triton is a Python-based language for writing GPU kernels — code that runs directly on the GPU at near-CUDA speed, but with far less manual effort. Created by Philippe Tillet at OpenAI in 2021, Triton lets you write high-performance GPU code without managing the low-level details that make CUDA difficult.

Think of it this way: CUDA is like driving a manual transmission race car — maximum control, but you manage the clutch, the gear shifts, and the racing line yourself. Triton is like a semi-automatic — you steer and decide the strategy, but the gearbox handles itself.

The Manual Burden of CUDA

Throughout this track, you've learned the concepts that CUDA programmers must manage by hand:

  • Thread indexing (Module 2) — computing blockIdx.x * blockDim.x + threadIdx.x for every memory access
  • Memory coalescing (Module 5) — ensuring adjacent threads read adjacent addresses, or paying a huge bandwidth penalty
  • Shared memory (Module 5) — declaring __shared__ buffers, calling __syncthreads() at exactly the right places, avoiding bank conflicts
  • Tiling (Module 6) — loading data into SRAM in chunks, managing tile boundaries, accumulating partial results
  • Synchronization — placing __syncthreads() barriers so threads don't read stale data

Each concept alone is manageable. But a real CUDA kernel — even something as simple as softmax — requires combining all of them simultaneously. That's where the expertise gap appears: knowing each concept is not the same as correctly orchestrating them together.

Block-Level vs Thread-Level Programming

Triton's key insight: work at the block level, not the thread level. To understand what this means, let's first recall how the GPU organizes work (from Module 2):

  • A thread is the smallest unit of execution on the GPU. Each thread runs the same program but on different data — like one worker on an assembly line handling one item.
  • A block (also called a thread block) is a group of threads that run together on the same SM (streaming multiprocessor). Threads within a block can share fast on-chip memory (SRAM) and synchronize with each other. A typical block has 128-1024 threads.
  • The GPU launches many blocks in a grid. Each block is assigned to an SM, and the SM runs all the threads inside that block.

So the hierarchy is: grid → blocks → threads. A grid might have 100 blocks, each block has 256 threads, giving 25,600 threads total — all running the same program on different pieces of data.

Your code creates:

Grid

B0

256 threads

B1

256 threads

B2

256 threads

B3

256 threads

B4

256 threads

B5

256 threads

→GPU
assigns

GPU hardware runs:

SM0
B0B3
SM1
B1B4
SM2
B2B5

You create blocks — the GPU decides which SM runs each one

Now here's the difference in how CUDA and Triton think about this structure:

CUDA (thread-level): You write a program for one thread. Each thread handles one element. You launch 1024 threads, and every thread computes its own index to figure out which element to touch:

// CUDA: you think about ONE element at a time
int i = blockIdx.x * blockDim.x + threadIdx.x;  // "which element am I?"
if (i < N) output[i] = input[i] + 1;            // process that one element

You're responsible for: computing the index correctly, making sure threads don't go out of bounds, ensuring adjacent threads read adjacent memory (coalescing), and coordinating if threads need to share results (__syncthreads__).

Triton (block-level): You write a program for a whole block of elements. Each program instance handles, say, 256 elements at once. You don't think about individual threads — you think about chunks of data:

# Triton: you think about a BLOCK of elements at a time
pid = tl.program_id(0)                              # "which block am I?"
offsets = pid * BLOCK + tl.arange(0, BLOCK)          # all 256 element positions
mask = offsets < N                                   # boundary check
data = tl.load(input_ptr + offsets, mask=mask)       # load all 256 at once
tl.store(output_ptr + offsets, data + 1, mask=mask)  # write all 256 at once

The compiler decides: how to split those 256 elements across threads within a block, how to coalesce the memory accesses, and whether to use shared memory. You just say "load this block, process it, store it."

The key difference: in CUDA, you are the thread — you compute one element and must coordinate with other threads. In Triton, you are the block — you describe what happens to a chunk of data, and the compiler handles the per-thread details.

This means:

  • No threadIdx.x — you use tl.program_id(0) to identify which block of work you're handling
  • No __syncthreads() — the compiler inserts synchronization where needed
  • No shared memory declarations — the compiler decides what goes in shared memory
  • No coalescing concerns — the compiler arranges memory access patterns automatically

The Abstraction Stack

Triton sits in the middle of a stack of tools, each trading ease for control:

GPU Abstraction Stack

PyTorch EagerL4

Write Python ops — runs immediately, one kernel per op

torch.compileL3

Annotate with @torch.compile — compiler fuses & optimizes for you

TritonL2

Write tile-level kernels in Python — full control, no C++ needed

CUDA C++L1

Write thread-level kernels — maximum control, maximum complexity

↑ easier↓ more control

The rest of this module walks through each level: how Triton simplifies code (Step 2), what the compiler does behind the scenes (Step 3), how torch.compile goes even further (Step 4), and when to use which tool (Step 5).

On the right panel: Click each level of the abstraction ladder to see what you control vs. what's automated. Notice how the lower you go, the more modules from this track become relevant — those are the concepts you'd have to manage manually.

Triton by Example

Fused Softmax: The Canonical Example

The best way to understand Triton is to see it in action. We'll look at fused softmax — the same operation from Module 8 where we saw FlashAttention eliminate HBM round-trips.

Quick recap: softmax turns a row of raw scores into probabilities that sum to 1. It requires finding the max (for numerical stability), subtracting it, exponentiating, summing, and dividing. In Module 8, we saw that fusing these steps into one kernel keeps intermediates in SRAM instead of writing them to HBM between each step.

The Triton Kernel

Here's the complete fused softmax kernel in Triton (~15 lines of actual logic):

@triton.jit
def softmax_kernel(input_ptr, output_ptr, stride, n_cols, BLOCK: tl.constexpr):
    row_idx = tl.program_id(0)              # which row am I?
    offsets = tl.arange(0, BLOCK)            # column offsets for this block
    mask = offsets < n_cols                   # boundary check

    # Load entire row into SRAM
    row = tl.load(input_ptr + row_idx * stride + offsets, mask=mask, other=-float('inf'))

    # Softmax: max → subtract → exp → sum → divide
    row_max = tl.max(row, axis=0)            # find max (compiler handles reduction)
    numerator = tl.exp(row - row_max)        # subtract-max trick + exponentiate
    denominator = tl.sum(numerator, axis=0)  # sum (compiler handles reduction)

    # Write result
    tl.store(output_ptr + row_idx * stride + offsets, numerator / denominator, mask=mask)

Let's unpack the key lines:

  • tl.program_id(0) — "which row am I processing?" This is the block-level equivalent of CUDA's blockIdx.x. No thread math needed.
  • tl.arange(0, BLOCK) — generates a range of column offsets [0, 1, 2, ..., BLOCK-1]. The compiler maps these to individual threads automatically.
  • mask = offsets < n_cols — handles rows that aren't exact multiples of BLOCK. Masked positions read -inf (which becomes 0 after softmax).
  • tl.load(...) and tl.store(...) — load and write data. The compiler generates coalesced memory accesses automatically.
  • tl.max(...) and tl.sum(...) — reductions across the block. The compiler generates shared memory + warp shuffle reductions + __syncthreads(). You write one line; CUDA would need ~10 lines for each.

What Would This Look Like in CUDA?

The same fused softmax in CUDA requires ~40+ lines: pointer arithmetic, shared memory declarations, explicit __syncthreads(), warp shuffle reduction loops, and manual boundary checking. Every concept from Modules 2, 5, and 6 shows up.

Lines of Code: Same Kernel

~15 lines
Triton
~45 lines
CUDA C++

Triton eliminates thread-level boilerplate — same GPU performance.

The line count isn't the whole story. Each CUDA line is also harder to get right — a misplaced __syncthreads() or a coalescing mistake silently degrades performance without any error.

On the right panel: Click "Show CUDA equivalent" to watch each Triton line expand into its CUDA equivalent. The line counter grows from 15 to 40+ as the hidden complexity reveals itself. Notice the automation badges at the bottom — each one is a concept from a previous module.

How Triton Works

What Happens Between Python and GPU?

You've seen that Triton code is much simpler than CUDA. But the GPU still runs the same low-level operations — threads, shared memory, coalesced loads. The difference is who writes that code: you, or the compiler.

When you run a Triton kernel, your Python code goes through a multi-stage compilation pipeline before the GPU executes anything. The details of each stage don't matter — you'll never read intermediate compiler output. What matters is what the compiler decides at each stage, because those decisions are exactly the concepts you learned in Modules 2-7.

The Compilation Stages (Brief Overview)

Your Triton Python code passes through 5 stages:

  1. Python → Triton IR — your decorated function is parsed into an intermediate representation
  2. Triton IR → Triton GPU IR — hardware-specific optimizations: memory layout, warp scheduling
  3. Triton GPU IR → LLVM IR — converted to LLVM's standard representation
  4. LLVM IR → PTX — NVIDIA's assembly language for GPUs
  5. PTX → GPU binary — final machine code the GPU actually executes

The result is cached on disk, so subsequent runs skip compilation entirely.

What the Compiler Decides

This is the important part. Each decision maps to a concept from a previous module:

Thread distribution (Module 2: Execution Model)

You write tl.program_id(0) and tl.arange(0, BLOCK) — block-level identifiers. The compiler turns these into specific thread assignments: which thread handles which element, how threads are grouped into warps, how warps are assigned to SMs.

Memory coalescing (Module 5: Memory Access Patterns)

You write tl.load(ptr + offsets). The compiler analyzes the access pattern and generates vectorized load instructions (ldg.128) that read 128 bytes at a time. It ensures adjacent threads read adjacent memory addresses — the coalescing pattern from Module 5 — without you thinking about it.

Shared memory + synchronization (Modules 5-6: Shared Memory, Tiling)

You write tl.max(row, axis=0) or tl.sum(...). The compiler:

  • Allocates shared memory for the reduction
  • Inserts __syncthreads() barriers at the right places
  • Generates warp shuffle instructions for the final reduction steps
  • Handles all the tree reduction logic from Module 6

In CUDA, each of these requires ~10 lines of careful code. In Triton, it's one line.

Tile size selection (Module 6: Tiling)

The @triton.autotune decorator defines a search space of configurations — different tile sizes, number of warps, pipeline stages. Triton benchmarks each configuration on your actual GPU and caches the fastest one. This is why Triton's first run on a new shape is slow (benchmarking), but subsequent runs are fast (cached result).

Tensor core mapping (Module 7: Tensor Cores)

For matrix operations, tl.dot(a, b) maps directly to hardware MMA (matrix multiply-accumulate) instructions — the Tensor Core operations from Module 7. The compiler handles accumulator precision and tile dimensions automatically.

On the right panel: Hover over the colored badges to see which lines of Triton code each compiler decision affects. Toggle to "Matmul" to see the tensor core mapping with tl.dot(). The module connection strip at the bottom shows how this step connects back to what you already know.

torch.compile

What is torch.compile?

Triton lets you write GPU kernels in Python — but you still need to write a kernel. You choose the algorithm, the tile sizes, and the memory access pattern. torch.compile goes one step further: it watches your regular PyTorch code run, figures out what can be optimized, and generates Triton kernels automatically.

Added in PyTorch 2.0, torch.compile is a single decorator that can speed up your model with no algorithm changes. Same code, same math — but under the hood, torch.compile traces the operations, fuses them where possible (using the fusion rules from Module 8), and generates optimized Triton kernels.

The Three-Stage Pipeline

Stage 1: TorchDynamo (Tracing)

Dynamo intercepts your Python code as it runs and records every tensor operation into a graph — an FX Graph. Think of it as a camera recording what your model does, operation by operation.

For y = relu(layernorm(x @ W + b)), Dynamo captures: matmul → add → layernorm → relu. These are the graph nodes.

Graph breaks: If your code has data-dependent control flow — like if x.sum() > 0: — the tracer can't record it as a static graph. This causes a "graph break" that forces separate compilation of each segment, reducing optimization opportunities.

Stage 2: AOT Autograd + TorchInductor (Fusion)

Inductor looks at the traced graph and applies fusion rules. Remember from Module 8: elementwise operations fuse freely, reductions are barriers.

In our example: add (elementwise), layernorm (reduction + elementwise), and relu (elementwise) get fused into a single kernel. The matmul stays separate — cuBLAS already handles it optimally.

Result: 4 separate kernel launches → 2 (one cuBLAS matmul + one fused Triton kernel).

Stage 3: Codegen (Triton Kernel Generation)

For each fused group, Inductor generates a Triton kernel. The add + layernorm + relu fusion becomes a single Triton function that loads input once, applies all three operations, and writes the final result — no intermediate HBM round-trips.

When torch.compile Helps

  • Standard tensor operations that follow fusion rules (pointwise chains, reduction + normalize patterns)
  • Repeated execution — compilation cost is amortized over many forward passes
  • Many small operations — fusion reduces kernel launch overhead

When It Doesn't Help

  • First call — compilation takes seconds (sometimes minutes for large models). Subsequent calls are cached.
  • Data-dependent control flow — if tensor.item() > 0: causes graph breaks. The tracer can only record static operation patterns.
  • Already-optimized ops — cuBLAS matmul and FlashAttention are already near-optimal. torch.compile won't improve them.
  • Dynamic shapes — different input sizes trigger recompilation. Use dynamic=True to mitigate.

On the right panel: Step through the 3-phase pipeline. Watch Dynamo trace ops into graph nodes, Inductor group them into fused blocks (notice the fusion rules from Module 8), and Codegen produce a Triton kernel. The before/after strip shows the concrete improvement: 4 launches → 2.

When to Use What

The Right Tool for the Problem

You now understand GPU programming from silicon to Python. The question isn't which tool is best — it's which tool fits the problem.

Here are the four levels, from easiest to most effort:

PyTorch eager — maximum flexibility, zero optimization. Your starting point for prototyping, debugging, and correctness testing. Every tensor operation launches a separate CUDA kernel.

torch.compile — one decorator, automatic fusion and Triton codegen. The default first step for production. Covers most optimization opportunities without writing any GPU code.

Triton — custom Python GPU kernels. Write your own fused operations when torch.compile can't fuse your pattern, or when you need specific control over tiling and memory access. Still Python — no C++ needed.

CUDA — maximum control, maximum effort. Only justified when Triton's ~80% performance ceiling isn't enough for your specific workload. For memory-bound operations, Triton already matches CUDA; the gap only appears in compute-bound kernels.

The Decision Process

The flowchart is simple:

1. Is your model fast enough? If yes, stop. Optimization without a bottleneck is wasted effort. Use profiling tools (like torch.profiler) to identify where time is actually spent.

2. Have you tried torch.compile? If not, start here. One line of code, and it handles fusion automatically. This resolves the majority of cases.

3. Is the bottleneck a standard operation? Standard operations like matrix multiplication and attention already have optimized library implementations (cuBLAS, FlashAttention). No custom kernel needed — just make sure you're using the library version.

4. Is your custom operation memory-bound or compute-bound? This is the roofline question from Module 4 — the thread that ties this entire track together.

  • Memory-bound (low arithmetic intensity): Triton matches CUDA performance. The bottleneck is HBM bandwidth, not compute, so Triton's compiler generates equally efficient memory access patterns. Use Triton.

  • Compute-bound (high arithmetic intensity): Triton reaches ~80% of CUDA performance. The gap comes from CUDA's ability to fine-tune warp-level scheduling and register usage. Write Triton first — only drop to CUDA if that 20% gap matters for your specific use case.

The Full Journey

This track has taken you from "why are GPUs fast?" to "which abstraction level should I use?" Here's how it all connects:

Module 1 (Why GPUs) explained that GPUs are fast because they have thousands of cores optimized for parallel work — the fundamental reason this entire stack exists.

Modules 2-3 (Execution Model, Memory Hierarchy) showed how the GPU organizes those cores (threads → warps → blocks → SMs) and where data lives (registers → shared memory/SRAM → HBM).

Module 4 (Roofline Model) gave you the universal diagnostic: is your kernel memory-bound or compute-bound? This question appears at every level of the abstraction stack.

Modules 5-6 (Memory Patterns, Tiling) taught the two most impactful optimizations: accessing memory in coalesced patterns and reusing data through tiling. These are exactly what Triton's compiler automates.

Module 7 (Tensor Cores) showed how specialized hardware accelerates matrix math — what tl.dot() maps to behind the scenes.

Module 8 (Operator Fusion) explained why fusing kernels matters — the principle that torch.compile applies automatically.

Module 9 (this module) connected all of it: Triton automates Modules 2-7, torch.compile automates Module 8's fusion, and the roofline model (Module 4) tells you when to go deeper.

On the right panel: Click through the decision tree to explore different optimization paths. Each recommendation highlights which modules are relevant. The bottom strip shows all 9 modules — watch which ones light up as you navigate different branches.

Frequently Asked Questions

© 2026 Learn AI Visuallycraftsman@craftsmanapps.com