The news. On June 28, 2026, researchers released OSWorld2.0, a benchmark of 108 long-horizon computer-use workflows spanning everyday and professional tasks. Each task runs in a real operating system and is scored by execution-based checks on the final state, not by trajectory matching. A human needs a median of ~1.6 hours per task, while a reference agent run averages ~318 tool calls. The strongest configuration tested — Claude Opus 4.8 with maximum thinking and batched tool calls — completes only 20.6%, and the authors name four dominant failure modes. Read the paper →

Picture the line cook on a Friday-night rush. Any single dish, they can cook flawlessly — that part is never the problem. The trouble is the service: hundreds of tickets over four hours, each one depending on rules set earlier and information arriving late. By midnight the cook has forgotten that table 4 flagged a nut allergy on the first ticket, missed the "no onions" change a server called over the pass, plated "the usual" without checking what the diner actually meant, and slid a plate out without tasting it. None of those are cooking-skill failures. They are the failures of sustaining four small disciplines a few hundred times in a row — and they are exactly what OSWorld2.0 catches in a computer-use agent.

This is why OSWorld2.0 grades the way it does. Instead of scoring the agent's clicks against a "correct" path, it runs the agent in a live OS and checks the final state of the machine — is the document actually formatted, the record actually filed? Grading the outcome rather than the trajectory is what lets the benchmark see the four named failure modes the authors flag — the same discipline you'd practice as error analysis, reading real traces and clustering what went wrong before writing any eval (OSWorld2.0, arXiv 2606.29537):

Failure modeWhat it looks like on the screenThe discipline it breaksWhen in the task it bites
Loses track of constraintsSaves over the original file the task said to preserveHolding the rules live in working memoryLate — a constraint set at step 1 has fallen out of context
Misses mid-task informationActs on a stale value after a dialog or new file changed itPausing to read what the environment now showsWhenever the world changes after the plan was made
Guesses instead of askingPicks one reading of an ambiguous request and commitsSeeking the missing fact instead of inventing oneAt any fork the task under-specifies
Skips verificationDeclares done without confirming the change actually landedRe-checking its own work before finishingAt the end — and on every irreversible step

What 20.6% actually measures

Here is the trap that makes a long horizon so brutal, and it is not the "a few hard stages multiply" story. In OSWorld2.0 the trap isn't that any single action is hard — it's the sheer number of chances to slip across the whole run. Hold the reference figure fixed at ~318 tool calls and suppose the agent avoids all four failure modes on any single action 99.5% of the time (illustrative — a back-solved per-action rate). Sustaining that across the whole task means 0.995³¹⁸ ≈ 0.20 — almost exactly the 20.6% OSWorld2.0 reports. Tighten the per-action discipline to 99% and it collapses to 0.99³¹⁸ ≈ 0.04; loosen it to a near-flawless 99.9% and you finally reach 0.999³¹⁸ ≈ 0.73. The gap between today's agents and reliable computer use is not a smarter click — it is a tenfold cut in per-action slips, sustained for hundreds of steps. This is the long-horizon cousin of compounding errors: the same multiplication, but driven by horizon length rather than per-stage difficulty.

That reframes what would actually move the number. Three of the four failure modes are cheap to fix in principle and expensive to ignore: an agent that pauses to read the screen after each action catches mid-task changes, an agent that asks instead of guessing removes a whole class of wrong forks, and an agent that re-checks its own work converts silent end-state errors into caught ones. None of those require a bigger model — they require spending a few extra actions on discipline the agent currently skips to look fast. They are also precisely what you'd watch for in shadow mode before trusting a computer-use agent with a real workflow.

Goes deeper in: AI Agents → Evals & Diagnostics → Error Analysis First

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based