LLM Serving
7 interactive modules from inference-engine internals to prefix caching. All free.
IE
Module 1
Inference Engine
vLLM scheduler, memory manager, model executor — how a serving engine processes requests end to end.
Start module →
SD
Module 2
Speculative Decoding
Draft model → parallel verification — generating multiple tokens per forward pass.
Start module →
PD
Module 3
Prefill/Decode Disaggregation
Separate GPU pools for prefill and decode, chunked prefill, and why disaggregation helps.
Start module →
SM
Module 4
Serving Metrics & SLOs
TTFT, TPOT, throughput, goodput, P99 — measuring and reasoning about inference quality.
Start module →
CG
Module 5
CUDA Graphs
Eliminating kernel launch overhead for decode — capturing and replaying GPU work.
Start module →
ML
Module 6
Multi-LoRA Serving
Dynamic adapter loading per request — SGMV kernels, unified paging, rank-vs-KV tradeoff.
Start module →
PC
Module 7
Prefix Caching & RadixAttention
Cross-request KV reuse — SGLang's radix tree vs vLLM's block-hash chain, eviction, production pricing.
Start module →