AI Explained

Plain explanations of trending AI concepts, with live visualizations.

LLM

CacheWeaver reorders RAG evidence for prefix-cache reuse — Prefix-cache-aware evidence reordering — What does it mean?

CacheWeaver reorders the retrieved chunks in a RAG prompt so the serving engine reuses its KV prefix cache — cutting median TTFT 20–33% with no measured quality loss.

Agent

Agent leaderboards mislead under distribution shift (IBM) — Predictive validity — What does it mean?

IBM: agent leaderboards rank models by one aggregate score that fails under distribution shift — measure predictive validity instead.

LLM

LoopCoder-v2: two loops of a shared block beat deeper looping — Weight-tied block looping — What does it mean?

Running one shared transformer block twice lifts a 7B model from 43.0 to 64.4 on SWE-bench — but three or more loops regress, a non-monotonic sweet spot.

LLM

ConSA learns where to put full vs sliding-window attention per head — Controllable attention sparsity — What does it mean?

ConSA learns, under a fixed budget, which attention heads get full attention and which get a cheap sliding window — instead of hand-coded rules.

GPU

NVIDIA Blackwell sweeps MLPerf Training 6.0 — Strong scaling — What does it mean?

Strong scaling asks if 2× the GPUs really halves training time. MLPerf 6.0: 8,192 Blackwell GPUs trained DeepSeek-V3 671B to target in 2.02 min.

LLM

AnchorKV makes KV-cache compression safety-aware — Refusal anchor in the KV cache — What does it mean?

AnchorKV gives KV-cache eviction a safety-aware penalty, so compressing the cache to save memory no longer quietly erodes a model's alignment.

LLM

AMD ATOM + ATOMesh — Prefill/decode disaggregation on ROCm — What does it mean?

AMD's new ROCm serving stack splits LLM inference into a compute-bound prefill phase and a memory-bound decode phase, running each on its own pool of GPUs.

LLM

Variable-width transformers cut FLOPs 22% — Hourglass layer width — What does it mean?

A transformer with wide outer layers and a narrow middle — an hourglass — cuts 22% of FLOPs and 15% of KV-cache at matched loss.

LLM

Ternary Mamba compresses an SSM 3.6x with 4 GPU-hours of QAT — Ternary quantization-aware training — What does it mean?

Ternary Mamba forces every weight to -1, 0, or +1 and trains on that grid, shrinking a Mamba SSM 3.6x in 4 GPU-hours.

LLM

SoftMoE replaces top-k expert routing — Differentiable soft top-k routing — What does it mean?

SoftMoE swaps a mixture-of-experts' hard top-k router for a differentiable soft top-k, so the router learns straight from the task loss.

Agent

PreAct compiles agent runs into replayable programs — Compiled trajectory replay — What does it mean?

PreAct compiles an agent's successful run into a replayable program it re-runs with no per-step model call — 8.5–13× faster on repeats.

LLM

SubQ 1.1 Small hits a 12M-token context — Subquadratic sparse attention — What does it mean?

SubQ 1.1 trades dense n² attention for a near-linear sparse form — a 12M-token context, 56× faster than FlashAttention-2 at 1M.