AI Explained

Plain explanations of trending AI concepts, with live visualizations.

LLM

Tangram speeds multi-turn serving up to 2.6× — Per-head KV cache budgets — What does it mean?

Tangram sizes each attention head's KV cache to what it actually keeps, not one uniform budget — lifting multi-turn serving up to 2.6×.

LLM

Google ships Gemma 4 QAT checkpoints — Quantization-Aware Training — What does it mean?

Gemma 4 ships QAT 4-bit checkpoints: training on the low-bit grid dodges the accuracy cliff that naive post-training rounding hits.

LLM

MatMul-only matrix inversion makes quantized Gated DeltaNet 5x faster — Truncated-Neumann triangular inverse — What does it mean?

Gated DeltaNet's chunk solve hides a sequential matrix inverse — a truncated Neumann series turns it into parallel MatMuls, ~5x faster.

Agent

AutoLab benchmarks frontier agents on long-horizon R&D tasks — Iterative experiment-loop evaluation — What does it mean?

AutoLab grades agents on the propose → run → measure → refine loop; across 17 models, sustained iteration — not the first answer — predicted success.

Agent

Token Budgets paper — Affine-typed budget ownership — What does it mean?

Token Budgets models an agent's token cap as a use-at-most-once resource — so a budget overrun fails to compile instead of overspending at runtime.

Agent

TELBench localizes where deep-research agents go wrong — Span-level error localization — What does it mean?

TELBench + DRIFT pinpoint which step of a deep-research agent's trajectory made the answer unreliable — span-level error localization, up to +30 pp.

Agent

StreamMA — Streaming inter-agent reasoning — What does it mean?

StreamMA streams each reasoning step between agents — B starts on A's reliable early steps, not the risky last one. +7.3pp avg accuracy.

GPU

NVIDIA RTX Spark superchip — Unified CPU–GPU memory — What does it mean?

RTX Spark wires a Grace CPU and a Blackwell GPU to one 128GB pool over NVLink-C2C, so the GPU skips the PCIe host–device copy.

LLM

Microsoft MAI-Code-1-Flash — Adaptive solution-length control — What does it mean?

MAI-Code-1-Flash scales how many reasoning tokens it spends to each task's difficulty — short chains for easy tasks, long only for hard ones.

LLM

KVarN squeezes the KV cache to 2 bits — Hadamard rotation — What does it mean?

KVarN rotates outliers out of the KV cache so 2-bit quantization fits every channel — calibration-free, no error compounding across decode.

Agent

Crafter paper — Multi-agent refinement harness with a directive critic — What does it mean?

Crafter's directive critic emits per-dimension fixes + typed edits, not a scalar score — a harness that lifts figures 33.73 → 50.34.

LLM

WASH attack washes out LLM text watermarks — Watermark removal by model-averaging — What does it mean?

Averaging the output distributions of 3–5 independent LLMs cancels each one's text watermark — detection z-scores fall from 5–300 to below 2.