AI Explained
Plain explanations of trending AI concepts, with live visualizations.
SOP paper — Hardware-aware per-layer PTQ at FP6 — What does it mean?
SOP picks a different codebook per layer using activation weights — and at FP6, that beats vanilla FP8 reconstruction error using 1.5 fewer bits per weight.
PPOW paper — window-level RL for speculative drafters — What does it mean?
PPOW trains speculative-decoding drafters with WINDOW-level RL — three rewards adapt window size to KL divergence, lifting acceptance to 6.29–6.52 and end-to-end speedup to 3.4–4.4×.
MCP SEP-2663 lands Tasks extension — async task handles for long-running tool calls — What does it mean?
SEP-2663 lets an MCP server return a Task handle from tools/call; the client then drives it with tasks/get, tasks/update, and tasks/cancel — no blocked connections.
Is Grep All You Need? — Grep vs vector retrieval for agentic search — What does it mean?
An empirical study on 116 LongMemEval questions finds literal grep generally beats vector retrieval inside agent harnesses — and that harness design and tool-calling style dominate the retrieval algorithm choice.
AsyncFC paper — Symbolic futures in the decode stream — What does it mean?
AsyncFC inserts a typed placeholder when the model emits a tool call, so decoding keeps flowing while the tool runs — no retraining required.
vLLM v0.20 — TurboQuant 2-bit KV cache — What does it mean?
vLLM v0.20 ships TurboQuant — a 2-bit KV cache with per-block scales, cutting KV memory ~4× without crushing outliers.
vLLM v0.20 — FlashAttention 4 packing — What does it mean?
vLLM v0.20 ships FlashAttention 4 with packed variable-length attention — one fused kernel for a whole batch, no padding waste.
NVIDIA Nemotron 3 Nano Omni — 30B-A3B multimodal MoE — What does it mean?
Nemotron 3 routes text, image, audio, and video through ONE shared expert pool — 30B in HBM, ~3B active per token, ~9× throughput.
IBM Granite 4.1 — 8B dense matches the prior 32B MoE — What does it mean?
Granite 4.1 8B dense matches the prior 32B-A9B MoE on tool calling — and fits in roughly one quarter of the GPU memory.
GLM-5V-Turbo — native multimodal vs text-first vision-bolted designs — What does it mean?
GLM-5V-Turbo trains text + vision + tool data jointly from step 1, vs the LLaVA-style default that bolts a frozen ViT onto a text-only LLM.
DeepSeek V4-Pro and V4-Flash — long-context cost cut to a fraction — What does it mean?
V4-Pro and V4-Flash drop both per-token FLOPs and KV cache to roughly 7-27% of V3.2 at 1M context — same cluster, ~10-14× more concurrent users.
CoPD paper — Reinforcement Learning with Verifiable Rewards (RLVR) — What does it mean?
RLVR is post-training where a tiny verifier — unit tests, equality checks, proof assistant — replaces the learned reward model. Reward = 0 or 1, depending on whether the answer checks out.