Tracks

5 tracks that take you from GPU hardware to running an agent fleet in production. Start anywhere — each track stands on its own.

How GPUs execute parallel workloads — from threads and warps to tensor cores and FlashAttention.

Why GPUs & CUDA software stack
Execution model (threads, warps, SMs)
Memory hierarchy (registers → HBM)
Roofline model
Memory access patterns & coalescing
Tiling & matrix multiply
Tensor cores & mixed precision
Operator fusion & FlashAttention
Triton & torch.compile

Start track →

From tokenization to PagedAttention — how large language models process text and generate output.

Tokenization (BPE)
Embeddings & positional encoding
Self-attention (Q, K, V)
Transformer block
Text generation & sampling
KV cache
Quantization (GPTQ, AWQ, GGUF)
Batching (static vs continuous)
PagedAttention

How vLLM, SGLang, and TensorRT-LLM actually serve LLMs — scheduler, memory, and serving-engine internals.

Inference engine internals (vLLM)
Speculative decoding
Prefill/decode disaggregation
Serving metrics & SLOs (TTFT, TPOT, P99)
CUDA Graphs
Multi-LoRA serving
Prefix caching & RadixAttention

Foundations of AI agents — the loop, tools, workflows, retrieval, context engineering, planning, evals, and security — through visual interactive simulations.

The agent loop & state
Tool use & function calling (incl. MCP, Skills)
Workflow patterns + subagent topology
Retrieval & RAG
Context engineering
Planning & reflection
Evals & diagnostics
Security & the lethal trifecta
Capstone: three designs, same task

Production agent engineering — durable harnesses, observability, layered guardrails, deployment, incident response, and running an agent fleet against an SLO.

Production harness architecture
Observability for agents
Layered guardrails
Cost & latency engineering
Production evals & shadow mode
Deployment & rollout
Incident handling
Agent teams (multi-agent orchestration)
Capstone: reliability operations

Start track →

Track 6

Distributed Training

Coming soon

How to actually train large models across many GPUs — data, tensor, and pipeline parallelism, ZeRO/FSDP, NCCL collectives, and interconnect economics.

Data parallelism & AllReduce
Tensor parallelism
Pipeline parallelism (1F1B)
ZeRO / FSDP
3D parallelism
NCCL collectives
NVLink vs InfiniBand
Gradient checkpointing