AI Explained

Plain explanations of trending AI concepts, with live visualizations.

LLM2026-05-17

HuggingFace blog — Async continuous batching — What does it mean?

Sync continuous batching stalls the GPU while Python composes the next batch — overlapping CPU prep with GPU compute lifts GPU-active time from 76% to 99% in HuggingFace's report (~22% speedup).

Agent2026-05-17

FutureSim benchmark — Harness-level agent eval vs single-shot QA — What does it mean?

FutureSim replays 3 months of real-world news article-by-article and asks agents to forecast events — the best frontier agent reaches only 25%, and many score worse Brier skill than not predicting at all.

LLM2026-05-17

Compute Where It Counts — Per-token compute controller — What does it mean?

An ICML'26 paper bolts a lightweight policy network onto a frozen LLM that picks a per-token efficiency action — attention sparsity, MLP pruning, or activation bit-width — so easy tokens get cheap actions and hard tokens get full compute.

Agent2026-05-17

CDD paper — Context-Driven Decomposition for RAG knowledge conflict — What does it mean?

When retrieved context disagrees with the model's parametric knowledge, standard RAG hits ~15% under misconception injection — CDD extracts each claim, runs an explicit conflict-resolution sub-prompt, and reaches 71.3% on temporal-shift cases.

LLM2026-05-16

SOP paper — Hardware-aware per-layer PTQ at FP6 — What does it mean?

SOP picks a different codebook per layer using activation weights — and at FP6, that beats vanilla FP8 reconstruction error using 1.5 fewer bits per weight.

LLM2026-05-16

PPOW paper — window-level RL for speculative drafters — What does it mean?

PPOW trains speculative-decoding drafters with WINDOW-level RL — three rewards adapt window size to KL divergence, lifting acceptance to 6.29–6.52 and end-to-end speedup to 3.4–4.4×.

Agent2026-05-16

MCP SEP-2663 lands Tasks extension — async task handles for long-running tool calls — What does it mean?

SEP-2663 lets an MCP server return a Task handle from tools/call; the client then drives it with tasks/get, tasks/update, and tasks/cancel — no blocked connections.

Agent2026-05-16

Is Grep All You Need? — Grep vs vector retrieval for agentic search — What does it mean?

An empirical study on 116 LongMemEval questions finds literal grep generally beats vector retrieval inside agent harnesses — and that harness design and tool-calling style dominate the retrieval algorithm choice.

Agent2026-05-16

AsyncFC paper — Symbolic futures in the decode stream — What does it mean?

AsyncFC inserts a typed placeholder when the model emits a tool call, so decoding keeps flowing while the tool runs — no retraining required.

LLM2026-05-06

vLLM v0.20 — TurboQuant 2-bit KV cache — What does it mean?

vLLM v0.20 ships TurboQuant — a 2-bit KV cache with per-block scales, cutting KV memory ~4× without crushing outliers.

LLM2026-05-06

vLLM v0.20 — FlashAttention 4 packing — What does it mean?

vLLM v0.20 ships FlashAttention 4 with packed variable-length attention — one fused kernel for a whole batch, no padding waste.

LLM2026-05-06

NVIDIA Nemotron 3 Nano Omni — 30B-A3B multimodal MoE — What does it mean?

Nemotron 3 routes text, image, audio, and video through ONE shared expert pool — 30B in HBM, ~3B active per token, ~9× throughput.