AI Explained

Plain explanations of trending AI concepts, with live visualizations.

LLM

ZEDA paper — Zero-output expert self-distillation — What does it mean?

ZEDA injects parameter-free zero-output experts into a finished MoE and uses two-stage self-distillation to teach the router to skip ~50% of expert FLOPs at marginal accuracy loss.

LLM

SGLang v0.5.12 — TokenSpeed MLA backend — What does it mean?

SGLang v0.5.12 ships TokenSpeed MLA — a Blackwell attention backend for Multi-head Latent Attention that caches one shared low-rank K/V latent instead of per-head K/V, with TMA bulk-store reporting up to ~12× speedup on the cache-write kernel.

LLM

Spec-decode latency paper — Load-dependent latency model — What does it mean?

Paper decomposes spec-decode latency into load-independent and load-dependent parts via Little's Law — wins shrink as the server saturates.

LLM

RoPE provably fails at long context — Position and token discrimination limits — What does it mean?

Formal proof that RoPE's attention scores converge to random along BOTH the position axis and the token-identity axis as context grows — and the RoPE base parameter only trades one collapse for the other.

LLM

TIM paper — Training-Inference Mismatch in RL — What does it mean?

Zhong et al. introduce a controlled diagnostic, VeXact, that isolates rollout/policy numerical drift from every other RL instability — and show that drift alone, on the same nominal weights, is enough to collapse training.

LLM

SP-KV paper — Utility predictor for the KV cache — What does it mean?

Meta FAIR's SP-KV learns to write only high-value KV pairs to cache — 3-10× smaller footprint with little-to-no validation-loss or task-performance drop.

LLM

Quantization-conditioned attack paper — Outlier injection across AWQ/GPTQ/GGUF — What does it mean?

A quantization-conditioned attack hides one outlier in a weight block — AWQ / GPTQ / GGUF I-quants then stretch their per-block scale to fit it, collapsing most other weights toward zero. FP16 benign, INT4 malicious.

LLM

PreFT applies LoRA only to prefill — Prefill-only LoRA adapters — What does it mean?

Stanford's PreFT runs the LoRA adapter during prefill, then drops it before decode begins — the adapter's behavioural signal lives inside the KV cache it shaped, so decode runs the bare base model and serves 1.9× the requests on 512 concurrent adapters.

LLM

TFGN paper — Subspace-preserving updates for continual pre-training — What does it mean?

TFGN continually pre-trains an 8B LLM without replay buffers or task IDs by structuring each update to live in a subspace orthogonal to prior-domain knowledge — backward transfer −0.007, JS perplexity −26.8% from Python-only training.

LLM

HuggingFace blog — Async continuous batching — What does it mean?

Sync continuous batching stalls the GPU while Python composes the next batch — overlapping CPU prep with GPU compute lifts GPU-active time from 76% to 99% in HuggingFace's report (~22% speedup).

LLM

Compute Where It Counts — Per-token compute controller — What does it mean?

An ICML'26 paper bolts a lightweight policy network onto a frozen LLM that picks a per-token efficiency action — attention sparsity, MLP pruning, or activation bit-width — so easy tokens get cheap actions and hard tokens get full compute.

LLM

SOP paper — Hardware-aware per-layer PTQ at FP6 — What does it mean?

SOP picks a different codebook per layer using activation weights — and at FP6, that beats vanilla FP8 reconstruction error using 1.5 fewer bits per weight.