Tracks
5 tracks that take you from GPU hardware to running an agent fleet in production. Start anywhere — each track stands on its own.
How GPUs execute parallel workloads — from threads and warps to tensor cores and FlashAttention.
- Why GPUs & CUDA software stack
- Execution model (threads, warps, SMs)
- Memory hierarchy (registers → HBM)
- Roofline model
- Memory access patterns & coalescing
- Tiling & matrix multiply
- Tensor cores & mixed precision
- Operator fusion & FlashAttention
- Triton & torch.compile
From tokenization to PagedAttention — how large language models process text and generate output.
- Tokenization (BPE)
- Embeddings & positional encoding
- Self-attention (Q, K, V)
- Transformer block
- Text generation & sampling
- KV cache
- Quantization (GPTQ, AWQ, GGUF)
- Batching (static vs continuous)
- PagedAttention
How vLLM, SGLang, and TensorRT-LLM actually serve LLMs — scheduler, memory, and serving-engine internals.
- Inference engine internals (vLLM)
- Speculative decoding
- Prefill/decode disaggregation
- Serving metrics & SLOs (TTFT, TPOT, P99)
- CUDA Graphs
- Multi-LoRA serving
- Prefix caching & RadixAttention
Foundations of AI agents — the loop, tools, workflows, retrieval, context engineering, planning, evals, and security — through visual interactive simulations.
- The agent loop & state
- Tool use & function calling (incl. MCP, Skills)
- Workflow patterns + subagent topology
- Retrieval & RAG
- Context engineering
- Planning & reflection
- Evals & diagnostics
- Security & the lethal trifecta
- Capstone: three designs, same task
Production agent engineering — durable harnesses, observability, layered guardrails, deployment, incident response, and running an agent fleet against an SLO.
- Production harness architecture
- Observability for agents
- Layered guardrails
- Cost & latency engineering
- Production evals & shadow mode
- Deployment & rollout
- Incident handling
- Agent teams (multi-agent orchestration)
- Capstone: reliability operations
How to actually train large models across many GPUs — data, tensor, and pipeline parallelism, ZeRO/FSDP, NCCL collectives, and interconnect economics.
- Data parallelism & AllReduce
- Tensor parallelism
- Pipeline parallelism (1F1B)
- ZeRO / FSDP
- 3D parallelism
- NCCL collectives
- NVLink vs InfiniBand
- Gradient checkpointing