AI Explained
Plain explanations of trending AI concepts, with live visualizations.
Tangram speeds multi-turn serving up to 2.6× — Per-head KV cache budgets — What does it mean?
Tangram sizes each attention head's KV cache to what it actually keeps, not one uniform budget — lifting multi-turn serving up to 2.6×.
Google ships Gemma 4 QAT checkpoints — Quantization-Aware Training — What does it mean?
Gemma 4 ships QAT 4-bit checkpoints: training on the low-bit grid dodges the accuracy cliff that naive post-training rounding hits.
MatMul-only matrix inversion makes quantized Gated DeltaNet 5x faster — Truncated-Neumann triangular inverse — What does it mean?
Gated DeltaNet's chunk solve hides a sequential matrix inverse — a truncated Neumann series turns it into parallel MatMuls, ~5x faster.
AutoLab benchmarks frontier agents on long-horizon R&D tasks — Iterative experiment-loop evaluation — What does it mean?
AutoLab grades agents on the propose → run → measure → refine loop; across 17 models, sustained iteration — not the first answer — predicted success.
Token Budgets paper — Affine-typed budget ownership — What does it mean?
Token Budgets models an agent's token cap as a use-at-most-once resource — so a budget overrun fails to compile instead of overspending at runtime.
TELBench localizes where deep-research agents go wrong — Span-level error localization — What does it mean?
TELBench + DRIFT pinpoint which step of a deep-research agent's trajectory made the answer unreliable — span-level error localization, up to +30 pp.
StreamMA — Streaming inter-agent reasoning — What does it mean?
StreamMA streams each reasoning step between agents — B starts on A's reliable early steps, not the risky last one. +7.3pp avg accuracy.
NVIDIA RTX Spark superchip — Unified CPU–GPU memory — What does it mean?
RTX Spark wires a Grace CPU and a Blackwell GPU to one 128GB pool over NVLink-C2C, so the GPU skips the PCIe host–device copy.
Microsoft MAI-Code-1-Flash — Adaptive solution-length control — What does it mean?
MAI-Code-1-Flash scales how many reasoning tokens it spends to each task's difficulty — short chains for easy tasks, long only for hard ones.
KVarN squeezes the KV cache to 2 bits — Hadamard rotation — What does it mean?
KVarN rotates outliers out of the KV cache so 2-bit quantization fits every channel — calibration-free, no error compounding across decode.
Crafter paper — Multi-agent refinement harness with a directive critic — What does it mean?
Crafter's directive critic emits per-dimension fixes + typed edits, not a scalar score — a harness that lifts figures 33.73 → 50.34.
WASH attack washes out LLM text watermarks — Watermark removal by model-averaging — What does it mean?
Averaging the output distributions of 3–5 independent LLMs cancels each one's text watermark — detection z-scores fall from 5–300 to below 2.
