What is the OSWorld2.0 benchmark?

OSWorld2.0 is a benchmark of 108 long-horizon computer-use workflows that run in real operating-system environments. Tasks span everyday and professional software, take a human a median of about 1.6 hours each, and are scored by execution-based checks on the final state of the machine rather than by matching the agent's trajectory. The strongest configuration tested — Claude Opus 4.8 with maximum thinking and batched tool calls — completes only 20.6% of the tasks, and the authors name four dominant failure modes: losing track of constraints, missing information that arrives mid-task, guessing instead of asking, and skipping verification of its own work.

Why do the best computer-use agents only finish about 20.6% of tasks?

Because the tasks are long. A reference run uses roughly 318 tool calls, and an agent has to avoid all four failure modes on essentially every one of them. Even near-perfect per-action discipline compounds away over that many steps: an illustrative 99.5% chance of not slipping per action gives 0.995^318 ≈ 0.20, almost exactly the reported 20.6%. The bottleneck is not skill on any single click — it is sustaining discipline across hundreds of dependent actions, which is why long-horizon completion collapses far below short-task accuracy.

How is OSWorld2.0 different from other computer-use benchmarks like Workflow-GYM?

Both grade long, multi-step computer-use tasks by their end state, but they emphasize different lenses. Workflow-GYM frames the difficulty as per-stage competence compounding across a structured workflow. OSWorld2.0 focuses on the cognitive disciplines an agent must sustain over a long horizon in a live OS, and its contribution is a named taxonomy of four failure modes — losing constraints, missing mid-task information, guessing instead of asking, and skipping self-verification — observed through execution-based scoring rather than trajectory matching.

OSWorld2.0 benchmark: best computer-use agent finishes just 20.6% of tasks — Long-horizon computer-use failure modes

TL;DR

What is it: The OSWorld2.0 benchmark drops computer-use agents — models that drive a real desktop with mouse and keyboard — into 108 long, real-world workflows and scores whether each one actually reaches the right end state. The concept it exposes: the long-horizon failure modes that wreck agents over hundreds of steps.
Why it’s needed: Single demos make these agents look ready; OSWorld2.0 measures the part that decides production value — can the agent stay correct across a task that takes a person ~1.6 hours, run over roughly 318 actions? The honest answer is a reality check: the very best agent finishes only 20.6%.
vs previous: Short, single-step benchmarks grade one isolated click, so the same agent looks competent. OSWorld2.0 grades the final state of the computer over a long horizon, which surfaces four failures a single click never shows: losing constraints, missing mid-task information, guessing instead of asking, and skipping self-verification.

Jargon

Computer-use agent: An agent that drives ordinary software the way a person does — reading the screen and acting through mouse and keyboard — instead of calling a clean text API. The interface is the GUI, so it has to perceive, click, type, and recover from whatever the app shows.
OSWorld2.0: A benchmark of 108 long-horizon computer-use tasks run in real operating-system environments. A human takes a median of ~1.6 hours per task; the strongest agent tested completes only 20.6%.
Long horizon: A task that spans hundreds of dependent actions rather than one. A reference run uses roughly 318 tool calls — every one a fresh chance to slip.
Execution-based scoring: Grading by inspecting the actual final state of the machine — was the file really saved, the setting really changed — instead of matching the agent's step-by-step trajectory against a reference. It rewards a correct outcome no matter the path.
Constraint tracking: Holding the task's stated rules — the format, the deadline, the "don't overwrite" — live in working memory across the whole run. The first failure mode is letting an early constraint fall out of context.
Self-verification: The agent checking its own work before declaring done — re-reading the result, confirming the change landed. Skipping it is the failure mode that lets a confident-looking run end wrong.

The news. On June 28, 2026, researchers released OSWorld2.0, a benchmark of 108 long-horizon computer-use workflows spanning everyday and professional tasks. Each task runs in a real operating system and is scored by execution-based checks on the final state, not by trajectory matching. A human needs a median of ~1.6 hours per task, while a reference agent run averages ~318 tool calls. The strongest configuration tested — Claude Opus 4.8 with maximum thinking and batched tool calls — completes only 20.6%, and the authors name four dominant failure modes. Read the paper →

Picture the line cook on a Friday-night rush. Any single dish, they can cook flawlessly — that part is never the problem. The trouble is the service: hundreds of tickets over four hours, each one depending on rules set earlier and information arriving late. By midnight the cook has forgotten that table 4 flagged a nut allergy on the first ticket, missed the "no onions" change a server called over the pass, plated "the usual" without checking what the diner actually meant, and slid a plate out without tasting it. None of those are cooking-skill failures. They are the failures of sustaining four small disciplines a few hundred times in a row — and they are exactly what OSWorld2.0 catches in a computer-use agent.

This is why OSWorld2.0 grades the way it does. Instead of scoring the agent's clicks against a "correct" path, it runs the agent in a live OS and checks the final state of the machine — is the document actually formatted, the record actually filed? Grading the outcome rather than the trajectory is what lets the benchmark see the four named failure modes the authors flag — the same discipline you'd practice as error analysis, reading real traces and clustering what went wrong before writing any eval (OSWorld2.0, arXiv 2606.29537):

Failure mode	What it looks like on the screen	The discipline it breaks	When in the task it bites
Loses track of constraints	Saves over the original file the task said to preserve	Holding the rules live in working memory	Late — a constraint set at step 1 has fallen out of context
Misses mid-task information	Acts on a stale value after a dialog or new file changed it	Pausing to read what the environment now shows	Whenever the world changes after the plan was made
Guesses instead of asking	Picks one reading of an ambiguous request and commits	Seeking the missing fact instead of inventing one	At any fork the task under-specifies
Skips verification	Declares done without confirming the change actually landed	Re-checking its own work before finishing	At the end — and on every irreversible step

What 20.6% actually measures

Here is the trap that makes a long horizon so brutal, and it is not the "a few hard stages multiply" story. In OSWorld2.0 the trap isn't that any single action is hard — it's the sheer number of chances to slip across the whole run. Hold the reference figure fixed at ~318 tool calls and suppose the agent avoids all four failure modes on any single action 99.5% of the time (illustrative — a back-solved per-action rate). Sustaining that across the whole task means 0.995³¹⁸ ≈ 0.20 — almost exactly the 20.6% OSWorld2.0 reports. Tighten the per-action discipline to 99% and it collapses to 0.99³¹⁸ ≈ 0.04; loosen it to a near-flawless 99.9% and you finally reach 0.999³¹⁸ ≈ 0.73. The gap between today's agents and reliable computer use is not a smarter click — it is a tenfold cut in per-action slips, sustained for hundreds of steps. This is the long-horizon cousin of compounding errors: the same multiplication, but driven by horizon length rather than per-stage difficulty.

That reframes what would actually move the number. Three of the four failure modes are cheap to fix in principle and expensive to ignore: an agent that pauses to read the screen after each action catches mid-task changes, an agent that asks instead of guessing removes a whole class of wrong forks, and an agent that re-checks its own work converts silent end-state errors into caught ones. None of those require a bigger model — they require spending a few extra actions on discipline the agent currently skips to look fast. They are also precisely what you'd watch for in shadow mode before trusting a computer-use agent with a real workflow.

Goes deeper in: AI Agents → Evals & Diagnostics → Error Analysis First

Related explainers

Workflow-GYM — End-to-end GUI workflow completion — the sibling computer-use benchmark, framed as per-stage competence compounding across a workflow
WeaveBench — Trajectory-aware vs outcome-only grading — the opposite grading choice: judge the path, not just the end state
SIMMER — Latent failures in planning — failures that hide inside a plan before execution even starts
FutureSim — Harness-level agent eval vs single-shot QA — why long-horizon evaluation surfaces what a one-shot question can't

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based