Understand AI systems by seeing them work
Interactive visual simulations from GPU architecture to running an agent fleet in production. No GPU required. Free.
Hover over tokens to explore attention patterns
The Full AI Stack
5 tracks that take you from GPU hardware to running an agent fleet in production. Start anywhere — each track stands on its own.
AI Explained
Plain-language takes on trending AI concepts, with live simulations.
SOP paper — Hardware-aware per-layer PTQ at FP6 — What does it mean?
SOP picks a different codebook per layer using activation weights — and at FP6, that beats vanilla FP8 reconstruction error using 1.5 fewer bits per weight.
PPOW paper — window-level RL for speculative drafters — What does it mean?
PPOW trains speculative-decoding drafters with WINDOW-level RL — three rewards adapt window size to KL divergence, lifting acceptance to 6.29–6.52 and end-to-end speedup to 3.4–4.4×.
MCP SEP-2663 lands Tasks extension — async task handles for long-running tool calls — What does it mean?
SEP-2663 lets an MCP server return a Task handle from tools/call; the client then drives it with tasks/get, tasks/update, and tasks/cancel — no blocked connections.
GPU & CUDA
9 interactive modules — 9 interactive modules from GPU execution model to Triton & torch.compile. All free.
Why GPUs?
CPU vs GPU design philosophy, throughput vs latency, and the CUDA software stack.
Execution Model
Threads, warps, blocks, grids, and SMs — how GPUs schedule parallel work.
Memory Hierarchy
Registers, shared memory, L2, HBM, PCIe, and NVLink — where data lives.
Roofline Model
Compute-bound vs memory-bound — the universal performance mental model.
Memory Access Patterns
Coalesced access, bank conflicts, and why memory layout matters.
Tiling & Matrix Multiply
The fundamental GPU optimization — data reuse via shared memory tiling.
Tensor Cores & Mixed Precision
Hardware matrix multiply, FP16/BF16/FP8/INT8, and why dims must align.
Operator Fusion & FlashAttention
Fused kernels, IO-aware design, and why FlashAttention is fast.
Triton & torch.compile
Python-level GPU programming and the abstraction stack.
LLM Internals
9 interactive modules — 9 interactive modules from tokenization to PagedAttention. All free.
Tokenization
How text becomes tokens — BPE, subword splitting, byte-level encoding.
Embeddings
Tokens to vectors — embedding lookup, positional encoding, cosine similarity.
Self-Attention
Q, K, V — the attention mechanism that powers transformers.
Transformer Block
Attention + FFN + LayerNorm + residual — one block at a time.
Text Generation
Autoregressive decoding — temperature, top-k, top-p sampling.
KV Cache
Why KV caching is essential — prefill vs decode, GQA.
Quantization
Shrink LLMs — FP32/FP16/INT8/INT4, GPTQ, AWQ, QLoRA, GGUF.
Batching
Static vs continuous batching, and the memory-throughput tradeoff.
Paged Attention
How vLLM solves KV cache fragmentation — block tables, prefix sharing.
LLM Serving
7 interactive modules — 7 interactive modules from inference-engine internals to prefix caching. All free.
Inference Engine
vLLM scheduler, memory manager, model executor — how a serving engine processes requests end to end.
Speculative Decoding
Draft model → parallel verification — generating multiple tokens per forward pass.
Prefill/Decode Disaggregation
Separate GPU pools for prefill and decode, chunked prefill, and why disaggregation helps.
Serving Metrics & SLOs
TTFT, TPOT, throughput, goodput, P99 — measuring and reasoning about inference quality.
CUDA Graphs
Eliminating kernel launch overhead for decode — capturing and replaying GPU work.
Multi-LoRA Serving
Dynamic adapter loading per request — SGMV kernels, unified paging, rank-vs-KV tradeoff.
Prefix Caching & RadixAttention
Cross-request KV reuse — SGLang's radix tree vs vLLM's block-hash chain, eviction, production pricing.
AI Agents
9 interactive modules — 9 interactive modules. Free, browser-based.
The Agent Loop & State
Loop + state + LLM-OS + workflows-vs-agents decision + harness anatomy.
Tool Use & Function Calling
Schemas, ACI + anti-patterns, structured outputs, MCP two ways, Skills, attack-surface preview.
Workflow Patterns
When NOT to use an agent + the five named patterns + subagent topology.
Retrieval & RAG
Embeddings as coordinates, retrieve-then-generate, chunking, ANN trade, RAG failure modes.
Context Engineering
Four context failure modes, four fixes, tool-loading strategy, subagents as context isolation.
Planning & Reflection
Reasoning budget, ReAct + think tool, Reflexion + verifiers, when to retry, when to stop.
Evals & Diagnostics
Compounding errors, error analysis first, golden cases, pass^k, the four eval failure modes.
Security & the Lethal Trifecta
Private data + untrusted content + exfiltration vector. Data-flow graphs, structural defenses, capability scoping.
Capstone — Three Designs
RAG-only vs workflow vs autonomous agent on the same customer-order task. Compared end to end.
Agent Engineering
9 interactive modules — 9 interactive modules. Production-ready agent engineering — picks up where AI Agents Foundations ends. Free, browser-based.
Production Harness Architecture
Idempotency, checkpoints, retry policy, durable execution platforms — surviving crashes, deploys, and network failures.
Observability for Agents
Span-per-tick traces, what to log, replay, golden metrics vs vanity, alerting on stochastic systems.
Layered Guardrails
Defense-in-depth: input filters, output filters, policy enforcement, fail-safe vs fail-open.
Cost & Latency Engineering
Where the tokens go, prompt caching, result caching, parallel tool calls, batching at the agent layer.
Production Evals & Shadow Mode
Online vs offline evals, shadow traffic, A/B harness, drift detection, eval-driven rollout.
Deployment & Rollout
Prompts as code, canary rollouts, rolling releases, version pinning, rollback discipline.
Incident Handling
The first 15 minutes, trace replay in anger, root-cause discipline, postmortems, drills.
Agent Teams
Multi-agent orchestration — when teams beat a single agent, supervisor/worker, voting, handoffs, coordination cost.
Capstone — Reliability Operations
SLOs, error budgets, on-call rotations, runbooks. From Foundations to running an agent fleet.