# Learn AI Visually > Interactive visual simulations teaching LLM internals — from tokenization to PagedAttention. Free, browser-based, no GPU required. ## About Learn AI Visually is an interactive learning platform that teaches how large language models work through browser-based visual simulations. Each module includes step-by-step explanations with paired interactive visualizations. Full prose content (FAQ answers + learning objectives for every module) is available at [llms-full.txt](https://learnaivisually.com/llms-full.txt). ## Tracks ### LLM Internals (9 modules, free) - [Tokenization](https://learnaivisually.com/tracks/llm-internals/tokenization): Watch BPE build a vocabulary merge by merge on real text. Interactive simulator for byte pair encoding, subword splitting, and token-to-ID mapping. - [Embeddings](https://learnaivisually.com/tracks/llm-internals/embeddings): Drag tokens through an embedding lookup and watch them cluster in vector space. Interactive cosine similarity and positional encoding visualizer. - [Self-Attention](https://learnaivisually.com/tracks/llm-internals/attention): Interact with a live Q/K/V heatmap as attention scores compute in real time. See multi-head attention, causal masking, and softmax weighting step by step. - [Transformer Block](https://learnaivisually.com/tracks/llm-internals/transformer-block): Step through a transformer block live — LayerNorm, attention, residual, FFN — with activations shown at each stage. Pre-norm vs post-norm side-by-side. - [Text Generation](https://learnaivisually.com/tracks/llm-internals/generation): Tweak temperature, top-k, and top-p on a live sampling simulator and watch the probability distribution flatten or sharpen token by token. - [KV Cache](https://learnaivisually.com/tracks/llm-internals/kv-cache): Toggle KV caching on a running decoder and watch redundant recomputation collapse. Visual prefill vs decode, memory math, and grouped-query attention. - [Quantization](https://learnaivisually.com/tracks/llm-internals/quantization): Drag a precision slider from FP32 down to INT4 and watch weights quantize in real time. Visual GPTQ, AWQ, NF4, QLoRA, and GGUF naming conventions. - [Batching](https://learnaivisually.com/tracks/llm-internals/batching): Run static and continuous batching side-by-side on a live GPU timeline. Watch padding waste, slot occupancy, and continuous admission in real time. - [PagedAttention](https://learnaivisually.com/tracks/llm-internals/paged-attention): Watch vLLM page the KV cache into virtual blocks and reuse prefixes across requests live. Interactive block table, copy-on-write, and prefix sharing. ### GPU & CUDA (9 modules, free) - [Why GPUs?](https://learnaivisually.com/tracks/gpu-cuda/why-gpus): Race a CPU and GPU side-by-side on the same matmul. Visual latency-vs-throughput architecture, SIMD parallelism, and the CUDA software stack. - [Execution Model](https://learnaivisually.com/tracks/gpu-cuda/execution-model): Watch 32-thread warps march in lockstep across an SM, with divergence and block scheduling visualized. Interactive thread hierarchy and SIMT execution. - [Memory Hierarchy](https://learnaivisually.com/tracks/gpu-cuda/memory-hierarchy): Move data between registers, shared memory, L2, and HBM on a live bandwidth diagram. Visual latency, capacity, and NVLink vs PCIe tradeoffs. - [Roofline Model](https://learnaivisually.com/tracks/gpu-cuda/roofline-model): Plot any ML op on an interactive roofline and see compute vs memory limits. Watch larger batch sizes slide ops from memory-bound to compute-bound. - [Memory Access Patterns](https://learnaivisually.com/tracks/gpu-cuda/memory-access-patterns): Drag thread access patterns and watch 128-byte transactions coalesce or waste bandwidth. Interactive bank conflicts, strided access, and the padding trick. - [Tiling & Matrix Multiply](https://learnaivisually.com/tracks/gpu-cuda/tiling-matmul): Run naive vs tiled matmul side-by-side on a live shared-memory diagram. Watch data reuse, tree reduction, and the roofline shift toward compute-bound. - [Tensor Cores](https://learnaivisually.com/tracks/gpu-cuda/tensor-cores): Fire a Tensor Core MMA and watch 4×4 tiles multiply-accumulate in one clock. Visual FP32/TF32/BF16/FP16/FP8 throughput and mixed-precision loss scaling. - [Operator Fusion](https://learnaivisually.com/tracks/gpu-cuda/operator-fusion): Watch FlashAttention keep Q/K/V tiles in SRAM while naive attention trips to HBM. Interactive kernel fusion, online softmax, and IO complexity. - [Triton & torch.compile](https://learnaivisually.com/tracks/gpu-cuda/triton-torch-compile): Compare a CUDA kernel and its Triton rewrite side-by-side. Visual torch.compile pipeline, block-level abstractions, and the CUDA vs Triton tradeoff. ### LLM Serving (7 modules, free) - [Inference Engine Internals](https://learnaivisually.com/tracks/llm-serving/inference-engine): Watch vLLM's scheduler admit, page, and preempt requests on a live GPU timeline. Interactive continuous batching, KV blocks, and prefill vs decode. - [Speculative Decoding](https://learnaivisually.com/tracks/llm-serving/speculative-decoding): Run a draft model proposing tokens and a target model verifying them in parallel, live. Visual rejection sampling, acceptance rates, and 2-3x latency wins. - [Prefill/Decode Disaggregation](https://learnaivisually.com/tracks/llm-serving/prefill-decode-disaggregation): See prefill and decode fight on one GPU, then split across pools on a live timeline. Interactive chunked prefill, KV transfer, and 2-7x throughput gains. - [Serving Metrics & SLOs](https://learnaivisually.com/tracks/llm-serving/serving-metrics): Watch TTFT, TPOT, and P99 move live on a saturation curve as you raise load. Interactive goodput vs throughput, SLO targets, and capacity planning. - [CUDA Graphs](https://learnaivisually.com/tracks/llm-serving/cuda-graphs): See 300+ per-token kernel launches collapse into one graph replay on a live decode timeline. Interactive capture, padding buckets, and production tradeoffs. - [Multi-LoRA Serving](https://learnaivisually.com/tracks/llm-serving/multi-lora): Batch requests using different LoRA adapters and watch SGMV group them live. Visual 3-tier paging (GPU/CPU/disk) and the rank-vs-KV cache tradeoff. - [Prefix Caching](https://learnaivisually.com/tracks/llm-serving/prefix-caching): Fire repeat system prompts and watch the KV prefix hit on a live radix tree. Interactive SGLang vs vLLM APC, eviction safety, and cache hit pricing. ### AI Agents (9 modules, free) - [Agent Loop & State](https://learnaivisually.com/tracks/ai-agents/agent-loop-state): Visual introduction to the agent loop: gather → act → observe → repeat. See state mutate per tick, the LLM-OS framing, and the workflow-vs-agent decision rule. - [Tool Use](https://learnaivisually.com/tracks/ai-agents/tool-use): How agents use tools: schemas, the agent-computer interface, structured outputs, MCP, and Skills. Visual examples of good vs bad tool design. - [Workflow Patterns](https://learnaivisually.com/tracks/ai-agents/workflow-patterns): The five named workflow patterns: chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer. When NOT to use an agent. - [Retrieval & RAG](https://learnaivisually.com/tracks/ai-agents/retrieval-rag): Retrieval-augmented generation explained visually. Embeddings as coordinates, chunking strategies, the recall-vs-speed trade, and named RAG failure modes. - [Context Engineering](https://learnaivisually.com/tracks/ai-agents/context-engineering): Context as the agent's most expensive resource. The four context failure modes (poisoning, distraction, confusion, conflict) and the four fixes. - [Planning & Reflection](https://learnaivisually.com/tracks/ai-agents/planning-reflection): When agents should plan, retry, pause, or stop. Reasoning budget, ReAct, Reflexion, and termination logic — each tied to a 'when' decision. - [Evals & Diagnostics](https://learnaivisually.com/tracks/ai-agents/evals-diagnostics): Error analysis first, evals second. Compounding errors, the transition failure matrix, golden cases, and the four named eval failure modes. - [Security & the Lethal Trifecta](https://learnaivisually.com/tracks/ai-agents/security-trifecta): The dominant 2026 safety frame for agents. Private data + untrusted content + exfiltration vector = breach. Structural defenses, capability scoping, and exfiltration via tool calls. - [Capstone — Three Designs](https://learnaivisually.com/tracks/ai-agents/capstone-three-designs): RAG-only vs deterministic workflow vs autonomous agent on the same customer-order task. Trace comparison, failure modes, trifecta exposure, and the decision rule. ### Agent Engineering (9 modules, free) - [Production Harness Architecture](https://learnaivisually.com/tracks/agent-engineering/harness-architecture): How an agent harness survives process kills, deploys, and network failures. Idempotency, checkpoints, retry policy, and durable execution platforms compared. - [Observability for Agents](https://learnaivisually.com/tracks/agent-engineering/observability): Treat each agent tick as a span. What to log, what to alert on, and how to replay a failing trace. Vanity metrics vs metrics that matter. - [Layered Guardrails](https://learnaivisually.com/tracks/agent-engineering/guardrails): Input filters, output filters, policy enforcement, and the fail-safe-vs-fail-open decision. Defense-in-depth that does not collapse to a single LLM judge. - [Cost & Latency Engineering](https://learnaivisually.com/tracks/agent-engineering/cost-latency): Where the tokens and seconds actually go in an agent. Prompt and result caching, parallelizing tool calls, and batching at the agent layer. - [Production Evals](https://learnaivisually.com/tracks/agent-engineering/production-evals): Online vs offline evals, shadow mode, A/B harness, drift detection, and eval-driven rollout. How to ship a prompt change without breaking production. - [Deployment & Rollout](https://learnaivisually.com/tracks/agent-engineering/deployment-rollout): Treating prompts and tool schemas as code. Canary, rolling release, version pinning, and the rollback discipline that keeps a fleet healthy. - [Incident Handling](https://learnaivisually.com/tracks/agent-engineering/incident-handling): What to do in the first 15 minutes of an agent incident. Trace replay, root-cause discipline, postmortem pattern, and the drills that build the muscle. - [Agent Teams](https://learnaivisually.com/tracks/agent-engineering/agent-teams): When teams beat a single agent. Supervisor/worker, parallel agents with voting, handoffs, and the coordination tax you pay for the win. - [Reliability Operations](https://learnaivisually.com/tracks/agent-engineering/reliability-ops): SLOs, error budgets, on-call rotations, and runbooks that survive contact with production. The bridge from Foundations to running an agent fleet. ## AI Explained Trend-driven concept pages that use AI news as a hook to teach the underlying concepts with interactive simulations. ### Topics - [LLM](https://learnaivisually.com/ai-explained?topic=llm): Trending LLM concepts — model releases, inference internals, training methods. - [GPU](https://learnaivisually.com/ai-explained?topic=gpu): Trending GPU and CUDA topics — kernels, memory, parallelism. - [Agent](https://learnaivisually.com/ai-explained?topic=agent): Trending AI agent topics — tool use, planning, evaluation. - [AI Companies](https://learnaivisually.com/ai-explained?topic=ai-companies): Trending AI company news — releases, fundraising, joint ventures. ### Articles - [/ai-explained](https://learnaivisually.com/ai-explained): Index of all AI Explained pages - [vLLM v0.20 — FlashAttention 4 packing](https://learnaivisually.com/ai-explained/vllm-v0-20-fa4-packing): vLLM v0.20 ships FlashAttention 4 with packed variable-length attention — one fused kernel handles a whole batch of mixed-length sequences with no padding waste. - [vLLM v0.20 — TurboQuant 2-bit KV cache](https://learnaivisually.com/ai-explained/vllm-v0-20-turboquant-kv): vLLM v0.20 ships TurboQuant — a 2-bit KV cache with per-block asymmetric scales — cutting KV cache memory roughly 4× without crushing outlier values. - [DeepSeek V4 — long-context cost](https://learnaivisually.com/ai-explained/deepseek-v4-long-context-cost): V4-Pro and V4-Flash drop both per-token FLOPs and KV cache to ~7-27% of V3.2 at 1M context — same cluster, ~10-14× more concurrent users, paired drop. - [RLVR — Reinforcement Learning with Verifiable Rewards](https://learnaivisually.com/ai-explained/copd-rlvr): RLVR is post-training where a deterministic verifier — unit tests, equality checks, proof assistant — replaces the learned reward model in the CoPD loop. - [CoPD — co-evolving policy distillation](https://learnaivisually.com/ai-explained/copd-co-evolving-policy-distillation): CoPD trains N specialist LLMs in parallel as mutual on-policy teachers, fixing inter-capability divergence in mixed RLVR and behavioural-gap loss in frozen-OPD. - [GLM-5V-Turbo — native multimodal vs vision-bolted designs](https://learnaivisually.com/ai-explained/glm-5v-native-multimodal): GLM-5V-Turbo trains text + vision + tool data jointly from step 1, vs the LLaVA-style default that bolts a frozen ViT onto a text-only LLM. Why it matters for agentic tool use. - [IBM Granite 4.1 — 8B dense vs 32B MoE](https://learnaivisually.com/ai-explained/ibm-granite-4-1-dense-vs-moe): IBM Granite 4.1 8B dense matches the prior Granite 4.0 32B-A9B MoE on tool calling and instruction following — same per-token bandwidth, one quarter the HBM footprint. - [Nemotron 3 Nano Omni — 30B-A3B multimodal MoE](https://learnaivisually.com/ai-explained/nvidia-nemotron-3-multimodal-moe): NVIDIA Nemotron 3 Nano Omni routes text, image, audio, and video through one shared expert pool — 30B in HBM, ~3B active per token, ~9× higher decode throughput. - [AsyncFC — Symbolic futures in the decode stream](https://learnaivisually.com/ai-explained/asyncfc-symbolic-futures): AsyncFC inserts a typed symbolic future placeholder when an LLM emits a tool call, so the decoder keeps generating while tool execution runs in parallel. - [MCP SEP-2663 — async task handles](https://learnaivisually.com/ai-explained/mcp-sep-2663-async-task-handles): MCP SEP-2663 makes tools/call polymorphic: a server returns a Task handle the client drives with tasks/get, tasks/update, and tasks/cancel — no blocking. - [PPOW — window-level RL for speculative drafters](https://learnaivisually.com/ai-explained/ppow-window-level-rl-drafters): PPOW trains speculative drafters with window-level RL: three rewards adapt windows by KL, reaching 6.29-6.52 acceptance length and reported 3.4-4.4x speedups. - [SOP — Hardware-aware per-layer PTQ at FP6](https://learnaivisually.com/ai-explained/sop-ptq-fp6-beats-fp8): SOP searches per-layer codebooks with activation weighting; at FP6 it beats fixed FP8 reconstruction across six open model families using 1.5 fewer bits. - [Grep vs vector retrieval for agentic search](https://learnaivisually.com/ai-explained/grep-vs-vector-agentic-retrieval): Empirical study on 116 LongMemEval questions: literal grep generally beats vector retrieval inside agents; harness design dominates the algorithm choice. - [FutureSim — Harness-level agent eval vs single-shot QA](https://learnaivisually.com/ai-explained/futuresim-harness-level-eval): Max Planck's FutureSim replays 3 months of news article-by-article and grades agents end-to-end: best agent ~25%; many fall below no-prediction baseline. - [CDD — Context-Driven Decomposition for RAG knowledge conflict](https://learnaivisually.com/ai-explained/cdd-context-driven-decomposition): Standard RAG hits 15% under misconception injection; Context-Driven Decomposition extracts retrieval + parametric claims, resolves the conflict, reaches 71.3%. ## Upcoming Modules - Distributed Training (planned): data / tensor / pipeline parallelism, ZeRO / FSDP, NCCL collectives, NVLink / InfiniBand, gradient checkpointing.