AI Explained

Plain explanations of trending AI concepts, with live visualizations.

LLM

Distillation in LLM pre-training — Non-monotonic teacher strength — What does it mean?

Sweeping teacher and student sizes in pre-training distillation shows the helpfulness of a teacher peaks at small undertrained models — stronger teachers can hurt, and the gains land out-of-domain, not in-domain.

LLM

Complete-muE paper — Two-bridge muTransfer for MoE — What does it mean?

Complete-muE extends muTransfer to MoE via two bridges — active-width scaling (dense → activated width) and activated-expert scaling (across MoE shapes) — so one dense sweep transfers near-optimally to any expert count and top-k.

LLM

Cursor Composer 2.5 — Targeted textual feedback RL — What does it mean?

Cursor Composer 2.5 ships targeted textual feedback RL — a constructed short hint at a specific span in a long agent rollout becomes a teacher distribution, and on-policy distillation KL replaces the diffuse end-of-rollout scalar with a localized credit signal.

LLM

VPO paper — Vector-reward advantage vs GRPO scalar collapse — What does it mean?

VPO replaces GRPO's scalar advantage estimator with a vector-valued one — so the post-trained policy keeps a diverse solution distribution that pays off at pass@k and best@k as the search budget grows.

LLM

Gated DeltaNet-2 paper — Decoupled channel-wise erase/write gates — What does it mean?

Gated DeltaNet-2 replaces the single scalar gate of prior linear-attention models with two independent per-channel vectors — one for erasing old state, one for writing new state — so each dimension of the recurrent memory can decay at its own rate.

LLM

ACC paper — Tool-output unmasking — What does it mean?

ACC reformats agent trajectories as long-context QA pairs by unmasking tool outputs — Qwen3-30B-A3B gains +18.1 MRCR, matching Qwen3-235B-A22B 8× smaller.

LLM

RELEX paper — Rank-1 RLVR weight-trajectory extrapolation — What does it mean?

RELEX exploits the empirical finding that RLVR fine-tuning weight trajectories are near rank-1 — fit a line through 15% of training steps and extrapolate; matches full RLVR quality from a fraction of the compute.

LLM

OScaR paper — Token Norm Imbalance — What does it mean?

Token Norm Imbalance — a few tokens carry outsized KV norms along the sequence axis. Channel rotation can't flatten them; OScaR can.

LLM

MSSP paper — Scale-stable parameterization beyond muP — What does it mean?

MSSP applies Dynamical Mean Field Theory to MoE training and derives a parameterization that — unlike muP — keeps the optimal learning rate stable as both model width and expert count scale.

LLM

Mix-Quant paper — NVFP4 prefill + BF16 decode — What does it mean?

Mix-Quant quantizes only the prefill phase to NVFP4 and keeps decode in BF16 — up to 3× prefill speedup with task performance largely preserved, because the two phases sit on opposite sides of the roofline.

LLM

PSD paper — Parallel speculative decoding for diffusion LLMs — What does it mean?

Parallel speculative decoding gets up to 5.5× tokens per forward pass on diffusion LLMs by attacking spatial and temporal axes at once.

LLM

Attention Once Is All You Need — Persistent KV cache across queries — What does it mean?

AOIAYN persists the KV cache across queries in a streaming session and advances it as data arrives — prefill leaves the critical path and per-query latency stays constant in context length.