AI Explained

Plain explanations of trending AI concepts, with live visualizations.

LLM2026-05-16

SOP paper — Hardware-aware per-layer PTQ at FP6 — What does it mean?

SOP picks a different codebook per layer using activation weights — and at FP6, that beats vanilla FP8 reconstruction error using 1.5 fewer bits per weight.

LLM2026-05-16

PPOW paper — window-level RL for speculative drafters — What does it mean?

PPOW trains speculative-decoding drafters with WINDOW-level RL — three rewards adapt window size to KL divergence, lifting acceptance to 6.29–6.52 and end-to-end speedup to 3.4–4.4×.

Agent2026-05-16

MCP SEP-2663 lands Tasks extension — async task handles for long-running tool calls — What does it mean?

SEP-2663 lets an MCP server return a Task handle from tools/call; the client then drives it with tasks/get, tasks/update, and tasks/cancel — no blocked connections.

Agent2026-05-16

Is Grep All You Need? — Grep vs vector retrieval for agentic search — What does it mean?

An empirical study on 116 LongMemEval questions finds literal grep generally beats vector retrieval inside agent harnesses — and that harness design and tool-calling style dominate the retrieval algorithm choice.

Agent2026-05-16

AsyncFC paper — Symbolic futures in the decode stream — What does it mean?

AsyncFC inserts a typed placeholder when the model emits a tool call, so decoding keeps flowing while the tool runs — no retraining required.

LLM2026-05-06

vLLM v0.20 — TurboQuant 2-bit KV cache — What does it mean?

vLLM v0.20 ships TurboQuant — a 2-bit KV cache with per-block scales, cutting KV memory ~4× without crushing outliers.

LLM2026-05-06

vLLM v0.20 — FlashAttention 4 packing — What does it mean?

vLLM v0.20 ships FlashAttention 4 with packed variable-length attention — one fused kernel for a whole batch, no padding waste.

LLM2026-05-06

NVIDIA Nemotron 3 Nano Omni — 30B-A3B multimodal MoE — What does it mean?

Nemotron 3 routes text, image, audio, and video through ONE shared expert pool — 30B in HBM, ~3B active per token, ~9× throughput.

LLM2026-05-06

IBM Granite 4.1 — 8B dense matches the prior 32B MoE — What does it mean?

Granite 4.1 8B dense matches the prior 32B-A9B MoE on tool calling — and fits in roughly one quarter of the GPU memory.

LLM2026-05-06

GLM-5V-Turbo — native multimodal vs text-first vision-bolted designs — What does it mean?

GLM-5V-Turbo trains text + vision + tool data jointly from step 1, vs the LLaVA-style default that bolts a frozen ViT onto a text-only LLM.

LLM2026-05-06

DeepSeek V4-Pro and V4-Flash — long-context cost cut to a fraction — What does it mean?

V4-Pro and V4-Flash drop both per-token FLOPs and KV cache to roughly 7-27% of V3.2 at 1M context — same cluster, ~10-14× more concurrent users.

LLM2026-05-06

CoPD paper — Reinforcement Learning with Verifiable Rewards (RLVR) — What does it mean?

RLVR is post-training where a tiny verifier — unit tests, equality checks, proof assistant — replaces the learned reward model. Reward = 0 or 1, depending on whether the answer checks out.