Is Learn AI Visually free?

Yes. Every track and module is free, no account required to browse.

Do I need a GPU to use it?

No. Everything runs in the browser as an interactive simulation — no local GPU, no install.

Engineers, students, and researchers who want to understand how LLMs and GPUs actually work by seeing them execute rather than reading prose.

What tracks are available today?

5 tracks are live — GPU & CUDA (9 modules), LLM Internals (9 modules), LLM Serving (7 modules), AI Agents (9 modules), Agent Engineering (9 modules). Distributed Training is upcoming.

Understand AI systems by seeing them work

Interactive visual simulations from GPU architecture to running an agent fleet in production. No GPU required. Free.

Start Learning →

Live Preview: Self-Attention

Thecatsatonthemat

The

cat

sat

the

mat

The

0.35

0.07

0.05

0.10

0.35

0.08

cat

0.06

0.26

0.47

0.06

0.07

0.08

sat

0.05

0.52

0.21

0.06

0.05

0.11

0.07

0.05

0.08

0.20

0.08

0.52

the

0.33

0.07

0.05

0.10

0.38

0.07

mat

0.04

0.07

0.10

0.42

0.05

0.32

Hover over tokens to explore attention patterns

The Full AI Stack

5 tracks that take you from GPU hardware to running an agent fleet in production. Start anywhere — each track stands on its own.

GPU

Track 19 modules

GPU & CUDA

How GPUs execute parallel workloads — from threads and warps to tensor cores and FlashAttention.

→

LLM

Track 29 modules

LLM Internals

From tokenization to PagedAttention — how large language models process text and generate output.

→

API

Track 37 modules

LLM Serving

How vLLM, SGLang, and TensorRT-LLM actually serve LLMs — scheduler, memory, and serving-engine internals.

→

AGT

Track 49 modules

AI Agents

Foundations of AI agents — the loop, tools, workflows, retrieval, context engineering, planning, evals, and security — through visual interactive simulations.

→

ENG

Track 59 modules

Agent Engineering

Production agent engineering — durable harnesses, observability, layered guardrails, deployment, incident response, and running an agent fleet against an SLO.

AI Explained

Plain-language takes on trending AI concepts, with live simulations.

View all →

LLM2026-05-16

SOP paper — Hardware-aware per-layer PTQ at FP6 — What does it mean?

SOP picks a different codebook per layer using activation weights — and at FP6, that beats vanilla FP8 reconstruction error using 1.5 fewer bits per weight.

LLM2026-05-16

PPOW paper — window-level RL for speculative drafters — What does it mean?

PPOW trains speculative-decoding drafters with WINDOW-level RL — three rewards adapt window size to KL divergence, lifting acceptance to 6.29–6.52 and end-to-end speedup to 3.4–4.4×.

Agent2026-05-16

MCP SEP-2663 lands Tasks extension — async task handles for long-running tool calls — What does it mean?

SEP-2663 lets an MCP server return a Task handle from tools/call; the client then drives it with tasks/get, tasks/update, and tasks/cancel — no blocked connections.

GPU & CUDA

9 interactive modules — 9 interactive modules from GPU execution model to Triton & torch.compile. All free.

View all →

Module 1

Why GPUs?

CPU vs GPU design philosophy, throughput vs latency, and the CUDA software stack.

Module 2

Execution Model

Threads, warps, blocks, grids, and SMs — how GPUs schedule parallel work.

Module 3

Memory Hierarchy

Registers, shared memory, L2, HBM, PCIe, and NVLink — where data lives.

Module 4

Roofline Model

Compute-bound vs memory-bound — the universal performance mental model.

Module 5

Memory Access Patterns

Coalesced access, bank conflicts, and why memory layout matters.

Module 6

Tiling & Matrix Multiply

The fundamental GPU optimization — data reuse via shared memory tiling.

Module 7

Tensor Cores & Mixed Precision

Hardware matrix multiply, FP16/BF16/FP8/INT8, and why dims must align.

Module 8

Operator Fusion & FlashAttention

Fused kernels, IO-aware design, and why FlashAttention is fast.

Module 9

Triton & torch.compile

Python-level GPU programming and the abstraction stack.

LLM Internals

9 interactive modules — 9 interactive modules from tokenization to PagedAttention. All free.

View all →

Module 1

Tokenization

How text becomes tokens — BPE, subword splitting, byte-level encoding.

[]

Module 2

Embeddings

Tokens to vectors — embedding lookup, positional encoding, cosine similarity.

Module 3

Self-Attention

Q, K, V — the attention mechanism that powers transformers.

Module 4

Transformer Block

Attention + FFN + LayerNorm + residual — one block at a time.

Module 5

Text Generation

Autoregressive decoding — temperature, top-k, top-p sampling.

Module 6

KV Cache

Why KV caching is essential — prefill vs decode, GQA.

Module 7

Quantization

Shrink LLMs — FP32/FP16/INT8/INT4, GPTQ, AWQ, QLoRA, GGUF.

Module 8

Batching

Static vs continuous batching, and the memory-throughput tradeoff.

Module 9

Paged Attention

How vLLM solves KV cache fragmentation — block tables, prefix sharing.

LLM Serving

7 interactive modules — 7 interactive modules from inference-engine internals to prefix caching. All free.

View all →

Module 1

Inference Engine

vLLM scheduler, memory manager, model executor — how a serving engine processes requests end to end.

Module 2

Speculative Decoding

Draft model → parallel verification — generating multiple tokens per forward pass.

Module 3

Prefill/Decode Disaggregation

Separate GPU pools for prefill and decode, chunked prefill, and why disaggregation helps.

Module 4

Serving Metrics & SLOs

TTFT, TPOT, throughput, goodput, P99 — measuring and reasoning about inference quality.

Module 5

CUDA Graphs

Eliminating kernel launch overhead for decode — capturing and replaying GPU work.

Module 6

Multi-LoRA Serving

Dynamic adapter loading per request — SGMV kernels, unified paging, rank-vs-KV tradeoff.

Module 7

Prefix Caching & RadixAttention

Cross-request KV reuse — SGLang's radix tree vs vLLM's block-hash chain, eviction, production pricing.

AI Agents

9 interactive modules — 9 interactive modules. Free, browser-based.

View all →

Module 1

The Agent Loop & State

Loop + state + LLM-OS + workflows-vs-agents decision + harness anatomy.

Module 2

Tool Use & Function Calling

Schemas, ACI + anti-patterns, structured outputs, MCP two ways, Skills, attack-surface preview.

Module 3

Workflow Patterns

When NOT to use an agent + the five named patterns + subagent topology.

Module 4

Retrieval & RAG

Embeddings as coordinates, retrieve-then-generate, chunking, ANN trade, RAG failure modes.

Module 5

Context Engineering

Four context failure modes, four fixes, tool-loading strategy, subagents as context isolation.

Module 6

Planning & Reflection

Reasoning budget, ReAct + think tool, Reflexion + verifiers, when to retry, when to stop.

Module 7

Evals & Diagnostics

Compounding errors, error analysis first, golden cases, pass^k, the four eval failure modes.

Module 8

Security & the Lethal Trifecta

Private data + untrusted content + exfiltration vector. Data-flow graphs, structural defenses, capability scoping.

Module 9

Capstone — Three Designs

RAG-only vs workflow vs autonomous agent on the same customer-order task. Compared end to end.

Agent Engineering

9 interactive modules — 9 interactive modules. Production-ready agent engineering — picks up where AI Agents Foundations ends. Free, browser-based.

View all →

Module 1

Production Harness Architecture

Idempotency, checkpoints, retry policy, durable execution platforms — surviving crashes, deploys, and network failures.

Module 2

Observability for Agents

Span-per-tick traces, what to log, replay, golden metrics vs vanity, alerting on stochastic systems.

Module 3

Layered Guardrails

Defense-in-depth: input filters, output filters, policy enforcement, fail-safe vs fail-open.

Module 4

Cost & Latency Engineering

Where the tokens go, prompt caching, result caching, parallel tool calls, batching at the agent layer.

Module 5

Production Evals & Shadow Mode

Online vs offline evals, shadow traffic, A/B harness, drift detection, eval-driven rollout.

Module 6

Deployment & Rollout

Prompts as code, canary rollouts, rolling releases, version pinning, rollback discipline.

Module 7

Incident Handling

The first 15 minutes, trace replay in anger, root-cause discipline, postmortems, drills.

Module 8

Agent Teams

Multi-agent orchestration — when teams beat a single agent, supervisor/worker, voting, handoffs, coordination cost.

Module 9

Capstone — Reliability Operations

SLOs, error budgets, on-call rotations, runbooks. From Foundations to running an agent fleet.