AI Explained
Plain explanations of trending AI concepts, with live visualizations.
Distillation in LLM pre-training — Non-monotonic teacher strength — What does it mean?
Sweeping teacher and student sizes in pre-training distillation shows the helpfulness of a teacher peaks at small undertrained models — stronger teachers can hurt, and the gains land out-of-domain, not in-domain.
Complete-muE paper — Two-bridge muTransfer for MoE — What does it mean?
Complete-muE extends muTransfer to MoE via two bridges — active-width scaling (dense → activated width) and activated-expert scaling (across MoE shapes) — so one dense sweep transfers near-optimally to any expert count and top-k.
Cursor Composer 2.5 — Targeted textual feedback RL — What does it mean?
Cursor Composer 2.5 ships targeted textual feedback RL — a constructed short hint at a specific span in a long agent rollout becomes a teacher distribution, and on-policy distillation KL replaces the diffuse end-of-rollout scalar with a localized credit signal.
VPO paper — Vector-reward advantage vs GRPO scalar collapse — What does it mean?
VPO replaces GRPO's scalar advantage estimator with a vector-valued one — so the post-trained policy keeps a diverse solution distribution that pays off at pass@k and best@k as the search budget grows.
Gated DeltaNet-2 paper — Decoupled channel-wise erase/write gates — What does it mean?
Gated DeltaNet-2 replaces the single scalar gate of prior linear-attention models with two independent per-channel vectors — one for erasing old state, one for writing new state — so each dimension of the recurrent memory can decay at its own rate.
ACC paper — Tool-output unmasking — What does it mean?
ACC reformats agent trajectories as long-context QA pairs by unmasking tool outputs — Qwen3-30B-A3B gains +18.1 MRCR, matching Qwen3-235B-A22B 8× smaller.
RELEX paper — Rank-1 RLVR weight-trajectory extrapolation — What does it mean?
RELEX exploits the empirical finding that RLVR fine-tuning weight trajectories are near rank-1 — fit a line through 15% of training steps and extrapolate; matches full RLVR quality from a fraction of the compute.
OScaR paper — Token Norm Imbalance — What does it mean?
Token Norm Imbalance — a few tokens carry outsized KV norms along the sequence axis. Channel rotation can't flatten them; OScaR can.
MSSP paper — Scale-stable parameterization beyond muP — What does it mean?
MSSP applies Dynamical Mean Field Theory to MoE training and derives a parameterization that — unlike muP — keeps the optimal learning rate stable as both model width and expert count scale.
Mix-Quant paper — NVFP4 prefill + BF16 decode — What does it mean?
Mix-Quant quantizes only the prefill phase to NVFP4 and keeps decode in BF16 — up to 3× prefill speedup with task performance largely preserved, because the two phases sit on opposite sides of the roofline.
PSD paper — Parallel speculative decoding for diffusion LLMs — What does it mean?
Parallel speculative decoding gets up to 5.5× tokens per forward pass on diffusion LLMs by attacking spatial and temporal axes at once.
Attention Once Is All You Need — Persistent KV cache across queries — What does it mean?
AOIAYN persists the KV cache across queries in a streaming session and advances it as data arrives — prefill leaves the critical path and per-query latency stays constant in context length.