AI Explained
Plain explanations of trending AI concepts, with live visualizations.
WASH attack washes out LLM text watermarks — Watermark removal by model-averaging — What does it mean?
Averaging the output distributions of 3–5 independent LLMs cancels each one's text watermark — detection z-scores fall from 5–300 to below 2.
PEFT scaling paper — Persistent personal adapters at million-scale — What does it mean?
Reframes a LoRA adapter from a cost-cutting trick into persistent per-user state — a million personal adapters served over one frozen base.
LongTraceRL — Rubric reward (entity-level process supervision) — What does it mean?
LongTraceRL scores each reasoning hop, not just the final answer — dense process reward, gated to correct rollouts so it cannot be gamed.
dMoE cuts diffusion-LLM MoE memory ~80% — block-level expert routing — What does it mean?
dMoE pools a diffusion block's per-token expert choices into one block-level decision — ~70→15 experts loaded, ~80% less memory.
Parallax — Local-linear attention vs FlashAttention 2/3 — What does it mean?
Parallax upgrades softmax attention from a flat local average to a slope-aware fit — sharper, and compute-bound past FlashAttention 2/3.
Claude Opus 4.8 — Cache-preserving mid-task system messages — What does it mean?
Opus 4.8 can inject a system message mid-conversation without busting the prompt cache — so the cached prefix is reused, not recomputed.
MarginGate — Margin-gated verification for batch-invariant decoding — What does it mean?
Why temp-0 BF16 decoding emits different tokens in a batch — and how MarginGate restores determinism by re-checking only the risky steps.
Google's Gemini Omni — Modality unification in a shared token space — What does it mean?
Gemini Omni turns text, image, audio, and video into tokens in one shared space, so a single model can read — and generate — any modality.
Parametric Memory Law links LoRA capacity to verbatim recall — The p > 0.5 recall threshold — What does it mean?
A finetune memorizes a token verbatim once its greedy probability crosses 0.5 — and a power law says how much LoRA capacity that takes.
MobileMoE — DRAM-aware MoE scaling for sub-3GB devices — What does it mean?
MobileMoE introduces a joint memory + compute scaling law for on-device MoE LMs; S/M/L fit under 3 GB at INT4 and reportedly run 1.8–3.8× faster prefill / 2.2–3.4× faster decode than dense baselines on Galaxy S25 and iPhone 16 Pro.
Gemini 3.5 Flash — Agent-first model design — What does it mean?
Some LLMs are trained from day one to live inside an agent loop — calling tools, recovering from errors — instead of being chat models with tool-calling bolted on later.
ThriftAttention paper — Importance-aware FP16/FP4 mixed-precision attention — What does it mean?
ThriftAttention runs ~5% of QK attention blocks (the top by a cheap importance score) in FP16 and the rest in FP4, recovering 89.1% of the FP4→FP16 long-context quality gap.
