The news. On June 28, 2026, researchers released OSWorld2.0, a benchmark of 108 long-horizon computer-use workflows spanning everyday and professional tasks. Each task runs in a real operating system and is scored by execution-based checks on the final state, not by trajectory matching. A human needs a median of ~1.6 hours per task, while a reference agent run averages ~318 tool calls. The strongest configuration tested — Claude Opus 4.8 with maximum thinking and batched tool calls — completes only 20.6%, and the authors name four dominant failure modes. Read the paper →
Picture the line cook on a Friday-night rush. Any single dish, they can cook flawlessly — that part is never the problem. The trouble is the service: hundreds of tickets over four hours, each one depending on rules set earlier and information arriving late. By midnight the cook has forgotten that table 4 flagged a nut allergy on the first ticket, missed the "no onions" change a server called over the pass, plated "the usual" without checking what the diner actually meant, and slid a plate out without tasting it. None of those are cooking-skill failures. They are the failures of sustaining four small disciplines a few hundred times in a row — and they are exactly what OSWorld2.0 catches in a computer-use agent.
This is why OSWorld2.0 grades the way it does. Instead of scoring the agent's clicks against a "correct" path, it runs the agent in a live OS and checks the final state of the machine — is the document actually formatted, the record actually filed? Grading the outcome rather than the trajectory is what lets the benchmark see the four named failure modes the authors flag — the same discipline you'd practice as error analysis, reading real traces and clustering what went wrong before writing any eval (OSWorld2.0, arXiv 2606.29537):
| Failure mode | What it looks like on the screen | The discipline it breaks | When in the task it bites |
|---|---|---|---|
| Loses track of constraints | Saves over the original file the task said to preserve | Holding the rules live in working memory | Late — a constraint set at step 1 has fallen out of context |
| Misses mid-task information | Acts on a stale value after a dialog or new file changed it | Pausing to read what the environment now shows | Whenever the world changes after the plan was made |
| Guesses instead of asking | Picks one reading of an ambiguous request and commits | Seeking the missing fact instead of inventing one | At any fork the task under-specifies |
| Skips verification | Declares done without confirming the change actually landed | Re-checking its own work before finishing | At the end — and on every irreversible step |
What 20.6% actually measures
Here is the trap that makes a long horizon so brutal, and it is not the "a few hard stages multiply" story. In OSWorld2.0 the trap isn't that any single action is hard — it's the sheer number of chances to slip across the whole run. Hold the reference figure fixed at ~318 tool calls and suppose the agent avoids all four failure modes on any single action 99.5% of the time (illustrative — a back-solved per-action rate). Sustaining that across the whole task means 0.995³¹⁸ ≈ 0.20 — almost exactly the 20.6% OSWorld2.0 reports. Tighten the per-action discipline to 99% and it collapses to 0.99³¹⁸ ≈ 0.04; loosen it to a near-flawless 99.9% and you finally reach 0.999³¹⁸ ≈ 0.73. The gap between today's agents and reliable computer use is not a smarter click — it is a tenfold cut in per-action slips, sustained for hundreds of steps. This is the long-horizon cousin of compounding errors: the same multiplication, but driven by horizon length rather than per-stage difficulty.
That reframes what would actually move the number. Three of the four failure modes are cheap to fix in principle and expensive to ignore: an agent that pauses to read the screen after each action catches mid-task changes, an agent that asks instead of guessing removes a whole class of wrong forks, and an agent that re-checks its own work converts silent end-state errors into caught ones. None of those require a bigger model — they require spending a few extra actions on discipline the agent currently skips to look fast. They are also precisely what you'd watch for in shadow mode before trusting a computer-use agent with a real workflow.
Goes deeper in: AI Agents → Evals & Diagnostics → Error Analysis First
Related explainers
- Workflow-GYM — End-to-end GUI workflow completion — the sibling computer-use benchmark, framed as per-stage competence compounding across a workflow
- WeaveBench — Trajectory-aware vs outcome-only grading — the opposite grading choice: judge the path, not just the end state
- SIMMER — Latent failures in planning — failures that hide inside a plan before execution even starts
- FutureSim — Harness-level agent eval vs single-shot QA — why long-horizon evaluation surfaces what a one-shot question can't