AI Explained

Plain explanations of trending AI concepts, with live visualizations.

GPU

LongLive-2.0 — NVFP4 W4A4 across training and inference — What does it mean?

NVIDIA's LongLive-2.0 runs training AND inference of a 5B long-video model in NVFP4 — W4A4 matmul plus a 4-bit KV cache — for 2.15× training and 1.84× inference speedup.

LLM

Spec-decode latency paper — Load-dependent latency model — What does it mean?

Paper decomposes spec-decode latency into load-independent and load-dependent parts via Little's Law — wins shrink as the server saturates.

LLM

RoPE provably fails at long context — Position and token discrimination limits — What does it mean?

Formal proof that RoPE's attention scores converge to random along BOTH the position axis and the token-identity axis as context grows — and the RoPE base parameter only trades one collapse for the other.

Agent

RecMem paper — Subconscious + recurrence-triggered agent memory — What does it mean?

RecMem encodes every agent interaction into a cheap embedding-only 'subconscious' vector store and only invokes the LLM to consolidate clusters whose density crosses a recurrence threshold — reportedly up to 87% fewer memory-construction tokens.

Agent

MSR delegation study — Cascading fidelity loss over 20 iterations — What does it mean?

Microsoft Research's delegation stress test runs 20 rounds of LLM-to-LLM document editing with constrained in-loop verification — strong frontier models lose 19–34% artifact fidelity by iteration 20.

Agent

MCP SEP-2577 — Three deprecations and a one-year migration window — What does it mean?

SEP-2577 deprecates three early MCP features — Roots, Sampling, Logging — and introduces a Deprecated lifecycle that keeps them functional for one year, then Removed.

Agent

Tool router paper — Contextual-bandit tool routing — What does it mean?

Reframes the choice between equivalent tool providers (two search APIs, two code executors, …) as a contextual bandit problem — the router learns answer quality per service cycle, not just lowest latency.

LLM

TIM paper — Training-Inference Mismatch in RL — What does it mean?

Zhong et al. introduce a controlled diagnostic, VeXact, that isolates rollout/policy numerical drift from every other RL instability — and show that drift alone, on the same nominal weights, is enough to collapse training.

LLM

SP-KV paper — Utility predictor for the KV cache — What does it mean?

Meta FAIR's SP-KV learns to write only high-value KV pairs to cache — 3-10× smaller footprint with little-to-no validation-loss or task-performance drop.

LLM

Quantization-conditioned attack paper — Outlier injection across AWQ/GPTQ/GGUF — What does it mean?

A quantization-conditioned attack hides one outlier in a weight block — AWQ / GPTQ / GGUF I-quants then stretch their per-block scale to fit it, collapsing most other weights toward zero. FP16 benign, INT4 malicious.

LLM

PreFT applies LoRA only to prefill — Prefill-only LoRA adapters — What does it mean?

Stanford's PreFT runs the LoRA adapter during prefill, then drops it before decode begins — the adapter's behavioural signal lives inside the KV cache it shaped, so decode runs the bare base model and serves 1.9× the requests on 512 concurrent adapters.

LLM

TFGN paper — Subspace-preserving updates for continual pre-training — What does it mean?

TFGN continually pre-trains an 8B LLM without replay buffers or task IDs by structuring each update to live in a subspace orthogonal to prior-domain knowledge — backward transfer −0.007, JS perplexity −26.8% from Python-only training.