AI Explained

Plain explanations of trending AI concepts, with live visualizations.

Agent

AdaPlanBench tests agent planning under incremental constraints — Adaptive replanning under hidden constraints — What does it mean?

AdaPlanBench hides each task's rules until a plan violates one — the best of 10 LLMs clears just 67.75%, and user constraints are harder than world constraints.

Agent

AutoLab benchmarks frontier agents on long-horizon R&D tasks — Iterative experiment-loop evaluation — What does it mean?

AutoLab grades agents on the propose → run → measure → refine loop; across 17 models, sustained iteration — not the first answer — predicted success.

Agent

Token Budgets paper — Affine-typed budget ownership — What does it mean?

Token Budgets models an agent's token cap as a use-at-most-once resource — so a budget overrun fails to compile instead of overspending at runtime.

Agent

TELBench localizes where deep-research agents go wrong — Span-level error localization — What does it mean?

TELBench + DRIFT pinpoint which step of a deep-research agent's trajectory made the answer unreliable — span-level error localization, up to +30 pp.

Agent

StreamMA — Streaming inter-agent reasoning — What does it mean?

StreamMA streams each reasoning step between agents — B starts on A's reliable early steps, not the risky last one. +7.3pp avg accuracy.

Agent

Crafter paper — Multi-agent refinement harness with a directive critic — What does it mean?

Crafter's directive critic emits per-dimension fixes + typed edits, not a scalar score — a harness that lifts figures 33.73 → 50.34.

Agent

Harness-1 — State-externalizing search harness — What does it mean?

Harness-1 is a 20B search agent that keeps working memory in an external harness, not a growing transcript — so context stays flat as the search deepens.

Agent

GrepSeek trains a search agent to use shell commands — GRPO-trained shell-command search — What does it mean?

GrepSeek trains an agent to search a raw corpus with shell commands via a Tutor/Planner distillation then GRPO — index-free agentic retrieval.

Agent

COLLEAGUE.SKILL — Capability vs behavior skill tracks — What does it mean?

COLLEAGUE.SKILL turns one expert trace into a versioned skill package: a capability track (what to do) plus a behavior track (how to do it).

Agent

Agent-harness scaling law: feedback quality predicts success, not raw compute — Effective Feedback Compute (EFC) — What does it mean?

Effective Feedback Compute (EFC) predicts agent-harness success from feedback quality, not raw compute — far tighter than spend does.

Agent

Claude Opus 4.8 — Parallel-subagent dynamic workflows — What does it mean?

Opus 4.8 'dynamic workflows' let Claude Code run parallel subagents, so wall-clock is set by the slowest subtask, not the sum.

Agent

OmniRetrieval — Source-native query dispatch — What does it mean?

OmniRetrieval routes each query to text, tables, or graphs natively — so JOINs and graph edges survive instead of collapsing into one flat vector index.