AI Explained
Plain explanations of trending AI concepts, with live visualizations.
Google releases Gemma 4 12B — Encoder-free multimodal projection — What does it mean?
Gemma 4 12B drops the separate vision and audio encoders — image patches and audio go straight into the token stream, no ViT.
FlashMemory cuts DeepSeek-V4's KV cache to 13.5% — Lookahead Sparse Attention — What does it mean?
FlashMemory's Lookahead Sparse Attention trains a small indexer to keep only the KV-cache chunks a token will use — shrinking the physical cache to 13.5%.
Chiaroscuro Attention cuts attention FLOPs 62% — Spectral-entropy token routing — What does it mean?
Chiaroscuro scores each token's spectral entropy and sends most tokens through a cheap frequency-domain mixer, paying full attention only for the few.
SigmaScale learns its SVD scaling matrices — Learned scaling for truncated-SVD compression — What does it mean?
SigmaScale learns two scaling vectors under an activation-aware loss so truncated-SVD throws away less — shrinking LLM weights by rank, not bits.
MiniMax M3 ships open-weight 1M context — MiniMax Sparse Attention (MSA) — What does it mean?
MiniMax M3 (open-weight, 1M context) runs on MiniMax Sparse Attention — block-sparse KV gather that cuts per-token compute ~20× at one million tokens.
EmbedFilter — Unembedding matrix as a feature lens — What does it mean?
LLMs make blurry text embeddings because a subspace in their unembedding matrix injects frequent words — EmbedFilter projects it out in one linear transform.
Code2LoRA gives code models per-repo knowledge — Hypernetwork-generated LoRA adapters — What does it mean?
Code2LoRA's hypernetwork stamps out a repo-specific LoRA adapter in one pass — repo knowledge with zero prompt tokens at inference.
Tangram speeds multi-turn serving up to 2.6× — Per-head KV cache budgets — What does it mean?
Tangram sizes each attention head's KV cache to what it actually keeps, not one uniform budget — lifting multi-turn serving up to 2.6×.
Google ships Gemma 4 QAT checkpoints — Quantization-Aware Training — What does it mean?
Gemma 4 ships QAT 4-bit checkpoints: training on the low-bit grid dodges the accuracy cliff that naive post-training rounding hits.
MatMul-only matrix inversion makes quantized Gated DeltaNet 5x faster — Truncated-Neumann triangular inverse — What does it mean?
Gated DeltaNet's chunk solve hides a sequential matrix inverse — a truncated Neumann series turns it into parallel MatMuls, ~5x faster.
Microsoft MAI-Code-1-Flash — Adaptive solution-length control — What does it mean?
MAI-Code-1-Flash scales how many reasoning tokens it spends to each task's difficulty — short chains for easy tasks, long only for hard ones.
KVarN squeezes the KV cache to 2 bits — Hadamard rotation — What does it mean?
KVarN rotates outliers out of the KV cache so 2-bit quantization fits every channel — calibration-free, no error compounding across decode.







