AI Explained

Plain explanations of trending AI concepts, with live visualizations.

Agent

HarnessBridge — Learned agent harness vs hand-engineered — What does it mean?

HarnessBridge replaces the hand-built agent harness with a learnable module — two projections that distill state and vet each action.

Agent

EvoArena + EvoMem — Patch-based agent memory — What does it mean?

EvoMem keeps agent memory as a changelog of structured patches — what changed and when — so the agent can reason about how its world evolved.

Agent

NVIDIA Blackwell leads AgentPerf, the first agentic-AI infra benchmark — Trajectory-replay benchmarking — What does it mean?

AgentPerf grades serving systems by replaying real multi-step agent runs, not single prompts — and Blackwell's GB300 leads on agents per megawatt.

Agent

WeaveBench: best computer-use agent clears just 41% — Trajectory-aware vs outcome-only grading — What does it mean?

WeaveBench finds the best computer-use agent clears 41.2% — and a trajectory-aware judge shows outcome-only grading flatters the rest.

Agent

SpatialClaw lifts agent spatial reasoning to 59.9% — Code-as-action vs structured tool-calls — What does it mean?

SpatialClaw makes a VLM agent's actions executable Python cells on a stateful kernel — observe-then-act beats rigid tool-calls, +11.2 pts to 59.9%.

Agent

A survey of agent-environment engineering — Symbolic vs neural environment synthesis — What does it mean?

A survey reframes building an agent's training world as engineering — and its sharpest split is hand-coded vs model-generated environments.

Agent

Workflow-GYM scores computer-use agents at ~30% on pro tasks — End-to-end GUI workflow completion — What does it mean?

Workflow-GYM drops computer-use agents into real pro software and grades the whole multi-stage job end to end — SOTA clears only ~30%.

Agent

Role-Agent paper — One LLM as agent and environment — What does it mean?

Role-Agent trains an agent by making one LLM play both the agent and the world it acts in — no external environment, no separate reward model.

Agent

SearchSwarm hits SOTA on BrowseComp with a 30B agent — Distilling delegation into the weights — What does it mean?

SearchSwarm bakes task decomposition and subagent delegation into a 30B model's weights via SFT — not prompts — and tops BrowseComp.

Agent

Anthropic's Claude Fable 5 & Mythos 5 — Safety-routing fallback classifiers — What does it mean?

Fable 5 ships a frontier model to everyone by routing under 5% of sensitive requests to a more conservative model instead of weakening it.

Agent

Self-evolving agents collapse over iterations — Continual experience internalization — What does it mean?

Self-evolving agents can degrade as they learn from their own runs. Three design choices decide whether they keep improving or collapse.

Agent

MLEvolve: self-evolving agents beat AlphaEvolve — Progressive Monte Carlo Graph Search — What does it mean?

Progressive Monte Carlo Graph Search lets MLEvolve share discoveries across branches — SOTA on MLE-Bench in half the usual budget.