AI Explained
Plain explanations of trending AI concepts, with live visualizations.
HarnessBridge — Learned agent harness vs hand-engineered — What does it mean?
HarnessBridge replaces the hand-built agent harness with a learnable module — two projections that distill state and vet each action.
EvoArena + EvoMem — Patch-based agent memory — What does it mean?
EvoMem keeps agent memory as a changelog of structured patches — what changed and when — so the agent can reason about how its world evolved.
NVIDIA Blackwell leads AgentPerf, the first agentic-AI infra benchmark — Trajectory-replay benchmarking — What does it mean?
AgentPerf grades serving systems by replaying real multi-step agent runs, not single prompts — and Blackwell's GB300 leads on agents per megawatt.
WeaveBench: best computer-use agent clears just 41% — Trajectory-aware vs outcome-only grading — What does it mean?
WeaveBench finds the best computer-use agent clears 41.2% — and a trajectory-aware judge shows outcome-only grading flatters the rest.
SpatialClaw lifts agent spatial reasoning to 59.9% — Code-as-action vs structured tool-calls — What does it mean?
SpatialClaw makes a VLM agent's actions executable Python cells on a stateful kernel — observe-then-act beats rigid tool-calls, +11.2 pts to 59.9%.
A survey of agent-environment engineering — Symbolic vs neural environment synthesis — What does it mean?
A survey reframes building an agent's training world as engineering — and its sharpest split is hand-coded vs model-generated environments.
Workflow-GYM scores computer-use agents at ~30% on pro tasks — End-to-end GUI workflow completion — What does it mean?
Workflow-GYM drops computer-use agents into real pro software and grades the whole multi-stage job end to end — SOTA clears only ~30%.
Role-Agent paper — One LLM as agent and environment — What does it mean?
Role-Agent trains an agent by making one LLM play both the agent and the world it acts in — no external environment, no separate reward model.
SearchSwarm hits SOTA on BrowseComp with a 30B agent — Distilling delegation into the weights — What does it mean?
SearchSwarm bakes task decomposition and subagent delegation into a 30B model's weights via SFT — not prompts — and tops BrowseComp.
Anthropic's Claude Fable 5 & Mythos 5 — Safety-routing fallback classifiers — What does it mean?
Fable 5 ships a frontier model to everyone by routing under 5% of sensitive requests to a more conservative model instead of weakening it.
Self-evolving agents collapse over iterations — Continual experience internalization — What does it mean?
Self-evolving agents can degrade as they learn from their own runs. Three design choices decide whether they keep improving or collapse.
MLEvolve: self-evolving agents beat AlphaEvolve — Progressive Monte Carlo Graph Search — What does it mean?
Progressive Monte Carlo Graph Search lets MLEvolve share discoveries across branches — SOTA on MLE-Bench in half the usual budget.











