AI Explained

Plain explanations of trending AI concepts, with live visualizations.

S-Agent makes a VLM a planner directing tools to build one shared 3-D model of a scene — an 8B agent then rivals GPT-5.4.

Multi-LCB ports each LiveCodeBench task into 12 languages with one judge, so a score drop measures pure cross-language generalization.

GateMem benchmarks shared agent memory on utility, access control, and reliable forgetting at once — and finds no method passes all three.

ContextRL rewards a model for picking which of two near-identical contexts supports the answer — sharpening fine-grained evidence grounding.

FAPO has Claude Code diagnose where an LLM pipeline fails, then make scoped prompt or chain edits — beating GEPA on 15 of 18 benchmarks.

AtomMem distills an agent's long history into atomic facts, files them by event and time, and links them in an associative graph for retrieval.

LedgerAgent tracks an agent's task state in a separate ledger and checks domain policy against it before any irreversible tool call.

IBM: agent leaderboards rank models by one aggregate score that fails under distribution shift — measure predictive validity instead.

PreAct compiles an agent's successful run into a replayable program it re-runs with no per-step model call — 8.5–13× faster on repeats.

FastContext trains a separate read-only explorer subagent that finds code and returns citations, cutting a coding agent's tokens up to 60%.

A latent failure is a plan that runs to the end without erroring and still silently fails the goal — SIMMER finds up to 56% of LLM plans hide one.

CacheRL replaces live tool execution during RL rollouts with a three-tier fuzzy cache — 92% process accuracy vs GPT-5's 94% at ~100× less compute.