AI Explained
Plain explanations of trending AI concepts, with live visualizations.
AdaPlanBench tests agent planning under incremental constraints — Adaptive replanning under hidden constraints — What does it mean?
AdaPlanBench hides each task's rules until a plan violates one — the best of 10 LLMs clears just 67.75%, and user constraints are harder than world constraints.
AutoLab benchmarks frontier agents on long-horizon R&D tasks — Iterative experiment-loop evaluation — What does it mean?
AutoLab grades agents on the propose → run → measure → refine loop; across 17 models, sustained iteration — not the first answer — predicted success.
Token Budgets paper — Affine-typed budget ownership — What does it mean?
Token Budgets models an agent's token cap as a use-at-most-once resource — so a budget overrun fails to compile instead of overspending at runtime.
TELBench localizes where deep-research agents go wrong — Span-level error localization — What does it mean?
TELBench + DRIFT pinpoint which step of a deep-research agent's trajectory made the answer unreliable — span-level error localization, up to +30 pp.
StreamMA — Streaming inter-agent reasoning — What does it mean?
StreamMA streams each reasoning step between agents — B starts on A's reliable early steps, not the risky last one. +7.3pp avg accuracy.
Crafter paper — Multi-agent refinement harness with a directive critic — What does it mean?
Crafter's directive critic emits per-dimension fixes + typed edits, not a scalar score — a harness that lifts figures 33.73 → 50.34.
Harness-1 — State-externalizing search harness — What does it mean?
Harness-1 is a 20B search agent that keeps working memory in an external harness, not a growing transcript — so context stays flat as the search deepens.
GrepSeek trains a search agent to use shell commands — GRPO-trained shell-command search — What does it mean?
GrepSeek trains an agent to search a raw corpus with shell commands via a Tutor/Planner distillation then GRPO — index-free agentic retrieval.
COLLEAGUE.SKILL — Capability vs behavior skill tracks — What does it mean?
COLLEAGUE.SKILL turns one expert trace into a versioned skill package: a capability track (what to do) plus a behavior track (how to do it).
Agent-harness scaling law: feedback quality predicts success, not raw compute — Effective Feedback Compute (EFC) — What does it mean?
Effective Feedback Compute (EFC) predicts agent-harness success from feedback quality, not raw compute — far tighter than spend does.
Claude Opus 4.8 — Parallel-subagent dynamic workflows — What does it mean?
Opus 4.8 'dynamic workflows' let Claude Code run parallel subagents, so wall-clock is set by the slowest subtask, not the sum.
OmniRetrieval — Source-native query dispatch — What does it mean?
OmniRetrieval routes each query to text, tables, or graphs natively — so JOINs and graph edges survive instead of collapsing into one flat vector index.
