AI Explained
Plain explanations of trending AI concepts, with live visualizations.
Attention Amnesia: CoT fine-tuning wrecks long-range recall in hybrid LLMs — Training-free QK-Restore — What does it mean?
Fine-tuning a hybrid LLM to reason silently breaks long-range recall; QK-Restore rolls back just the Q/K projections to recover it — no retraining.
Latent Context LMs compress prompts 16x — Encoder-decoder prompt compression — What does it mean?
Latent Context LMs use a small encoder to squeeze a long prompt into a 16x shorter sequence of latent embeddings a decoder reads as tokens.
Google releases Gemma 4 12B — Encoder-free multimodal projection — What does it mean?
Gemma 4 12B drops the separate vision and audio encoders — image patches and audio go straight into the token stream, no ViT.
FlashMemory cuts DeepSeek-V4's KV cache to 13.5% — Lookahead Sparse Attention — What does it mean?
FlashMemory's Lookahead Sparse Attention trains a small indexer to keep only the KV-cache chunks a token will use — shrinking the physical cache to 13.5%.
Chiaroscuro Attention cuts attention FLOPs 62% — Spectral-entropy token routing — What does it mean?
Chiaroscuro scores each token's spectral entropy and sends most tokens through a cheap frequency-domain mixer, paying full attention only for the few.
SigmaScale learns its SVD scaling matrices — Learned scaling for truncated-SVD compression — What does it mean?
SigmaScale learns two scaling vectors under an activation-aware loss so truncated-SVD throws away less — shrinking LLM weights by rank, not bits.
MiniMax M3 ships open-weight 1M context — MiniMax Sparse Attention (MSA) — What does it mean?
MiniMax M3 (open-weight, 1M context) runs on MiniMax Sparse Attention — block-sparse KV gather that cuts per-token compute ~20× at one million tokens.
EmbedFilter — Unembedding matrix as a feature lens — What does it mean?
LLMs make blurry text embeddings because a subspace in their unembedding matrix injects frequent words — EmbedFilter projects it out in one linear transform.
Self-evolving agents collapse over iterations — Continual experience internalization — What does it mean?
Self-evolving agents can degrade as they learn from their own runs. Three design choices decide whether they keep improving or collapse.
MLEvolve: self-evolving agents beat AlphaEvolve — Progressive Monte Carlo Graph Search — What does it mean?
Progressive Monte Carlo Graph Search lets MLEvolve share discoveries across branches — SOTA on MLE-Bench in half the usual budget.
Code2LoRA gives code models per-repo knowledge — Hypernetwork-generated LoRA adapters — What does it mean?
Code2LoRA's hypernetwork stamps out a repo-specific LoRA adapter in one pass — repo knowledge with zero prompt tokens at inference.
AdaPlanBench tests agent planning under incremental constraints — Adaptive replanning under hidden constraints — What does it mean?
AdaPlanBench hides each task's rules until a plan violates one — the best of 10 LLMs clears just 67.75%, and user constraints are harder than world constraints.











