Tracks

5 tracks that take you from GPU hardware to running an agent fleet in production. Start anywhere — each track stands on its own.

GPU
Track 1
GPU & CUDA
9 modules

How GPUs execute parallel workloads — from threads and warps to tensor cores and FlashAttention.

  • Why GPUs & CUDA software stack
  • Execution model (threads, warps, SMs)
  • Memory hierarchy (registers → HBM)
  • Roofline model
  • Memory access patterns & coalescing
  • Tiling & matrix multiply
  • Tensor cores & mixed precision
  • Operator fusion & FlashAttention
  • Triton & torch.compile
Start track →
LLM
Track 2
LLM Internals
9 modules

From tokenization to PagedAttention — how large language models process text and generate output.

  • Tokenization (BPE)
  • Embeddings & positional encoding
  • Self-attention (Q, K, V)
  • Transformer block
  • Text generation & sampling
  • KV cache
  • Quantization (GPTQ, AWQ, GGUF)
  • Batching (static vs continuous)
  • PagedAttention
Start track →
API
Track 3
LLM Serving
7 modules

How vLLM, SGLang, and TensorRT-LLM actually serve LLMs — scheduler, memory, and serving-engine internals.

  • Inference engine internals (vLLM)
  • Speculative decoding
  • Prefill/decode disaggregation
  • Serving metrics & SLOs (TTFT, TPOT, P99)
  • CUDA Graphs
  • Multi-LoRA serving
  • Prefix caching & RadixAttention
Start track →
AGT
Track 4
AI Agents
9 modules

Foundations of AI agents — the loop, tools, workflows, retrieval, context engineering, planning, evals, and security — through visual interactive simulations.

  • The agent loop & state
  • Tool use & function calling (incl. MCP, Skills)
  • Workflow patterns + subagent topology
  • Retrieval & RAG
  • Context engineering
  • Planning & reflection
  • Evals & diagnostics
  • Security & the lethal trifecta
  • Capstone: three designs, same task
Start track →
ENG
Track 5
Agent Engineering
9 modules

Production agent engineering — durable harnesses, observability, layered guardrails, deployment, incident response, and running an agent fleet against an SLO.

  • Production harness architecture
  • Observability for agents
  • Layered guardrails
  • Cost & latency engineering
  • Production evals & shadow mode
  • Deployment & rollout
  • Incident handling
  • Agent teams (multi-agent orchestration)
  • Capstone: reliability operations
Start track →
DT
Track 6
Distributed Training
Coming soon

How to actually train large models across many GPUs — data, tensor, and pipeline parallelism, ZeRO/FSDP, NCCL collectives, and interconnect economics.

  • Data parallelism & AllReduce
  • Tensor parallelism
  • Pipeline parallelism (1F1B)
  • ZeRO / FSDP
  • 3D parallelism
  • NCCL collectives
  • NVLink vs InfiniBand
  • Gradient checkpointing