AI Explained

Plain explanations of trending AI concepts, with live visualizations.

LLM2026-06-02

WASH attack washes out LLM text watermarks — Watermark removal by model-averaging — What does it mean?

Averaging the output distributions of 3–5 independent LLMs cancels each one's text watermark — detection z-scores fall from 5–300 to below 2.

LLM2026-06-02

PEFT scaling paper — Persistent personal adapters at million-scale — What does it mean?

Reframes a LoRA adapter from a cost-cutting trick into persistent per-user state — a million personal adapters served over one frozen base.

LLM2026-06-02

LongTraceRL — Rubric reward (entity-level process supervision) — What does it mean?

LongTraceRL scores each reasoning hop, not just the final answer — dense process reward, gated to correct rollouts so it cannot be gamed.

LLM2026-06-02

dMoE cuts diffusion-LLM MoE memory ~80% — block-level expert routing — What does it mean?

dMoE pools a diffusion block's per-token expert choices into one block-level decision — ~70→15 experts loaded, ~80% less memory.

LLM2026-05-30

Parallax — Local-linear attention vs FlashAttention 2/3 — What does it mean?

Parallax upgrades softmax attention from a flat local average to a slope-aware fit — sharper, and compute-bound past FlashAttention 2/3.

LLM2026-05-30

Claude Opus 4.8 — Cache-preserving mid-task system messages — What does it mean?

Opus 4.8 can inject a system message mid-conversation without busting the prompt cache — so the cached prefix is reused, not recomputed.

LLM2026-05-30

MarginGate — Margin-gated verification for batch-invariant decoding — What does it mean?

Why temp-0 BF16 decoding emits different tokens in a batch — and how MarginGate restores determinism by re-checking only the risky steps.

LLM2026-05-30

Google's Gemini Omni — Modality unification in a shared token space — What does it mean?

Gemini Omni turns text, image, audio, and video into tokens in one shared space, so a single model can read — and generate — any modality.

LLM2026-05-29

Parametric Memory Law links LoRA capacity to verbatim recall — The p > 0.5 recall threshold — What does it mean?

A finetune memorizes a token verbatim once its greedy probability crosses 0.5 — and a power law says how much LoRA capacity that takes.

LLM2026-05-29

MobileMoE — DRAM-aware MoE scaling for sub-3GB devices — What does it mean?

MobileMoE introduces a joint memory + compute scaling law for on-device MoE LMs; S/M/L fit under 3 GB at INT4 and reportedly run 1.8–3.8× faster prefill / 2.2–3.4× faster decode than dense baselines on Galaxy S25 and iPhone 16 Pro.

LLM2026-05-29

Gemini 3.5 Flash — Agent-first model design — What does it mean?

Some LLMs are trained from day one to live inside an agent loop — calling tools, recovering from errors — instead of being chat models with tool-calling bolted on later.

LLM2026-05-26

ThriftAttention paper — Importance-aware FP16/FP4 mixed-precision attention — What does it mean?

ThriftAttention runs ~5% of QK attention blocks (the top by a cheap importance score) in FP16 and the rest in FP4, recovering 89.1% of the FP4→FP16 long-context quality gap.