AI Explained

Plain explanations of trending AI concepts, with live visualizations.

Agent2026-06-30

OSWorld2.0 benchmark: best computer-use agent finishes just 20.6% of tasks — Long-horizon computer-use failure modes — What does it mean?

OSWorld2.0 runs computer-use agents through 108 long real-world tasks. The best finishes just 20.6%, undone by four failure modes.

Agent2026-06-30

Agents-A1 matches trillion-param agents at 35B — Scaling the horizon, not the parameters — What does it mean?

Agents-A1, a 35B agent, matches trillion-param systems by training on long ~45K-token task runs — scaling the horizon, not the parameter count.

Agent2026-06-30

Agents struggle to know when to stop — Agentic abstention — What does it mean?

Agentic abstention is knowing the right moment for an agent to stop acting under uncertainty — a skill agents mis-time, and one CONVOLVE adds without retraining.

Agent2026-06-29

Ornith-1.0 ships open MIT-licensed coding models — Self-scaffolding RL — What does it mean?

Ornith-1.0's coding models learn to write their own RL training scaffold per task, then solve against it — with safety pushed outside the model.

Agent2026-06-27

NatureBench: coding agents beat Nature-paper SOTA on just 17.8% of tasks — Discovery vs reproduction agent benchmarking — What does it mean?

NatureBench scores coding agents on beating published SOTA from 90 Nature-paper tasks, not reproducing it — the best agent wins on only 17.8%.

Agent2026-06-27

AOHP runs agents as OS actors on Android: +21% tasks, -52% tokens — Agents as first-class OS actors — What does it mean?

AOHP makes AI agents privileged OS-level actors on Android — acting across apps through machine-friendly interfaces — lifting task completion 21% and cutting tokens 52%.

Agent2026-06-26

The Verification Horizon: verifying coding agents is harder than coding — Co-evolving verifiers — What does it mean?

For coding agents, checking a solution is now harder than writing one — a fixed reward saturates and gets gamed, so the verifier must co-evolve.

Agent2026-06-26

Qwen-AgentWorld trains a language model as a world model for RL agents — World model as a decoupled RL simulator — What does it mean?

Qwen-AgentWorld trains a language model to predict an environment's next state, so it can stand in as a fast, parallel simulator for training RL agents.

Agent2026-06-26

OPID extracts step- and episode-level skills to guide agentic RL training — On-Policy Skill Distillation — What does it mean?

OPID mines skills from an agent's own completed runs — episode strategy + step-level moves — to densify sparse agentic-RL rewards.

Agent2026-06-26

OpenThoughts-Agent open-sources a 100K-example agent training recipe — Task-source diversity in agent SFT — What does it mean?

OpenThoughts-Agent's open 100K-example agent-SFT recipe shows task-source diversity, not raw data volume, is what makes agents broadly capable.

Agent2026-06-26

Routing, voting, and mixture-of-agents hit a shared ceiling across 67 models — Co-failure ceiling — What does it mean?

When every model misses the same question, no router, vote, or mixture-of-agents can recover it — and that shared-failure rate is ~2.5× higher than correlation predicts.

Agent2026-06-24

Agentic CLEAR grades agents at three zoom levels (IBM) — System/trace/node eval granularity — What does it mean?

Agentic CLEAR reads traces your agent already logs and writes an LLM-judge diagnosis at three zoom levels: node, trace, and system.