AI Explained

Plain explanations of trending AI concepts, with live visualizations.

LLM

Google releases Gemma 4 12B — Encoder-free multimodal projection — What does it mean?

Gemma 4 12B drops the separate vision and audio encoders — image patches and audio go straight into the token stream, no ViT.

LLM

FlashMemory cuts DeepSeek-V4's KV cache to 13.5% — Lookahead Sparse Attention — What does it mean?

FlashMemory's Lookahead Sparse Attention trains a small indexer to keep only the KV-cache chunks a token will use — shrinking the physical cache to 13.5%.

LLM

Chiaroscuro Attention cuts attention FLOPs 62% — Spectral-entropy token routing — What does it mean?

Chiaroscuro scores each token's spectral entropy and sends most tokens through a cheap frequency-domain mixer, paying full attention only for the few.

LLM

SigmaScale learns its SVD scaling matrices — Learned scaling for truncated-SVD compression — What does it mean?

SigmaScale learns two scaling vectors under an activation-aware loss so truncated-SVD throws away less — shrinking LLM weights by rank, not bits.

LLM

MiniMax M3 ships open-weight 1M context — MiniMax Sparse Attention (MSA) — What does it mean?

MiniMax M3 (open-weight, 1M context) runs on MiniMax Sparse Attention — block-sparse KV gather that cuts per-token compute ~20× at one million tokens.

LLM

EmbedFilter — Unembedding matrix as a feature lens — What does it mean?

LLMs make blurry text embeddings because a subspace in their unembedding matrix injects frequent words — EmbedFilter projects it out in one linear transform.

LLM

Code2LoRA gives code models per-repo knowledge — Hypernetwork-generated LoRA adapters — What does it mean?

Code2LoRA's hypernetwork stamps out a repo-specific LoRA adapter in one pass — repo knowledge with zero prompt tokens at inference.

LLM

Tangram speeds multi-turn serving up to 2.6× — Per-head KV cache budgets — What does it mean?

Tangram sizes each attention head's KV cache to what it actually keeps, not one uniform budget — lifting multi-turn serving up to 2.6×.

LLM

Google ships Gemma 4 QAT checkpoints — Quantization-Aware Training — What does it mean?

Gemma 4 ships QAT 4-bit checkpoints: training on the low-bit grid dodges the accuracy cliff that naive post-training rounding hits.

LLM

MatMul-only matrix inversion makes quantized Gated DeltaNet 5x faster — Truncated-Neumann triangular inverse — What does it mean?

Gated DeltaNet's chunk solve hides a sequential matrix inverse — a truncated Neumann series turns it into parallel MatMuls, ~5x faster.

LLM

Microsoft MAI-Code-1-Flash — Adaptive solution-length control — What does it mean?

MAI-Code-1-Flash scales how many reasoning tokens it spends to each task's difficulty — short chains for easy tasks, long only for hard ones.

LLM

KVarN squeezes the KV cache to 2 bits — Hadamard rotation — What does it mean?

KVarN rotates outliers out of the KV cache so 2-bit quantization fits every channel — calibration-free, no error compounding across decode.