OpenComputer — Verifier-grounded benchmark synthesis
AgentThe news. On May 19, 2026, researchers from Yale NLP, UPenn, and UNC posted OpenComputer — a framework that treats environment construction as a constrained synthesis problem and releases a benchmark of 1,000 verifier-grounded tasks across 33 desktop applications. The reported headline numbers are GPT-5.4 at 68.3% success, Claude-Sonnet-4.6 at 64.4%, and Kimi-K2.6 at 58.8%. Two open-source computer-use agents — GUI-OWL-1.5-8B and EvoCUA-8B — reportedly drop from 52.3% and 46.1% on OSWorld to 5.7% and 10.9% on OpenComputer respectively, the kind of collapse that usually signals the older benchmark was easier than its scores suggested.
Picture a country trying to standardise its driver's license road test. The cheap option is to put a human examiner in the passenger seat and have them eyeball whether the student "drove well." It's quick to set up and impossible to defend in court. The expensive option is to wire every test car with sensors — steering angle, lane position, brake force, stop-line crossings — and write a small program that decides "pass" from those signals. Before the test goes nationwide, the examiners run a calibration day where they take a known-good driver and a known-bad driver around the course and tune the sensor thresholds until the program agrees with their judgment. From that point on, the road test grades itself. OpenComputer is the same bet for computer-use agents — build the sensors first, calibrate them, then write the maneuvers to fit.
The reason the bet matters technically is that desktop-agent rewards are otherwise extraordinarily slippery. A screenshot judge can be persuaded by a partially-correct UI state; an LLM grader can be prompt-injected by the very document the agent edits; a hand-rolled checker that compares the file path against a string can be defeated by a one-character change. Every one of these failure modes shows up as inflated scores on benchmarks where the verifier is the weak link. The paper's response is to make the verifier the load-bearing artifact and treat the task as derivative — if a task can't be grounded in the verifier stack, it doesn't enter the benchmark.
The synthesis pipeline: verifier generation, verifier-aware task synthesis, triple storage
OpenComputer's framework has several integrated components — the headline ones reported in the abstract are app-specific state verifiers, a verification layer the paper says self-evolves on calibration runs, a task-generation pipeline, and an evaluation harness. The three-stage synthesis loop below is the part that produces the released benchmark; it sits underneath the verification and harness layers, which keep grading runs after the benchmark ships.
Stage one is verifier generation. For each desktop application — Microsoft Office, Chrome, VS Code, GIMP, and 29 others — the pipeline builds executable checkers that read inspectable state. A "rename a file to invoice_2024" check reads the directory listing; a "set the slide background to blue" check parses the .pptx XML; a "add this contact to the address book" check queries the SQLite database the app uses. Each verifier is then refined on a small calibration set of ~15 easy-to-medium tasks: a strong agent runs them, and the verifier's false positives and false negatives are corrected until the calibration runs grade cleanly.
Stage two is verifier-aware task synthesis. A generator proposes realistic multi-step user goals — explicitly decoupled from verifier endpoints so the model doesn't just emit easy tasks the verifier already grades. The proposed goals are filtered for complexity and for groundability: a goal is accepted only if its success can be expressed in terms of the verifier stack, with new endpoints added when needed. The paper notes the benchmark deliberately excludes any task that needs visual judgment to stay reproducible.
Stage three stores each accepted task as a triple: ⟨instruction, sandbox-init, executable success criteria⟩. The instruction is what the agent sees; the sandbox-init is the deterministic starting state of the desktop; the success criteria are the verifier checks the task is graded against — emitting a deterministic pass/fail or an auditable partial-credit score grounded in real application state. The triple format is what makes the benchmark composable: adding a new task means writing one more triple, not extending a grader.
Why open-source agents collapse on OpenComputer
The headline drama is the cross-benchmark generalization gap. GUI-OWL-1.5-8B reportedly drops from 52.3% on OSWorld to 5.7% on OpenComputer; EvoCUA-8B falls from 46.1% to 10.9%. That's a roughly 5–10× collapse, larger than what a uniformly-harder benchmark would explain on its own. The likeliest read is that earlier benchmarks left enough wiggle room — through screenshot judges, vibes-based partial credit, or under-specified hand-rolled checkers — that an agent could pattern-match its way to credit on tasks it didn't really finish. A verifier grounded in inspectable state grants credit only where state matches the success criteria, and the gap reveals the residual.
| Agent | OSWorld score | OpenComputer score | Drop |
|---|---|---|---|
| GPT-5.4 (closed) | not directly comparable (OpenComputer §6) | 68.3% (§6) | — |
| Claude-Sonnet-4.6 (closed) | not directly comparable (OpenComputer §6) | 64.4% (§6) | — |
| GUI-OWL-1.5-8B (open) | ~52.3%, reportedly (§6, reproduced from authors) | 5.7% (§6) | ~9× |
| EvoCUA-8B (open) | ~46.1%, reportedly (§6, reproduced from authors) | 10.9% (§6) | ~4× |
The two open-source agents collapse 4–9× while the frontier models still post nontrivial scores on the strict eval, which is suggestive: the gap looks less like "OpenComputer is uniformly harder for everyone" and more like "open-source agents memorise the OSWorld eval surface more than they generalise." Even the frontier scores aren't a victory lap — the paper's abstract notes that all evaluated agents still fail end-to-end on roughly a third of tasks — but the relative drop is what's diagnostic.
A worked example with named numbers
Consider one task triple — (the per-step decomposition is illustrative; the headline scores come straight from the paper's §6). The instruction reads "Save the open report as invoice_2024 in /reports." The sandbox-init places a PDF at /tmp/report.pdf and ensures the /reports directory exists but is empty. The verifier runs three state-grounded checks: (1) a file named invoice_2024.pdf now exists in /reports; (2) its byte content matches the PDF originally at /tmp/report.pdf; (3) no extraneous files were created. A correct agent passes all three; a hallucinating agent that saved to /Users/Desktop/invoice.pdf fails check (1) outright. For the open-source agent that scored ~52.3% on OSWorld, the OSWorld verifier might have accepted a "close enough" path on a similar task; OpenComputer's state-grounded checks don't. Multiply that single-task discipline across 1,000 tasks, and the average reportedly drops from ~52.3% → 5.7% — most of the agent's earlier credit was on checks OpenComputer's verifier won't extend to off-target outcomes.
The reciprocal phrasing applies to the frontier models. Their actions more often land on the right path, the right file, the right metadata, so a stricter verifier costs them less of their score — but the abstract is explicit that even GPT-5.4 only reaches 68.3%, meaning the strict harness still leaves roughly a third of tasks ungraded as success. "Resilient relative to open-source" is not the same as "approaching saturation."
What this changes for agent evaluation pipelines
For teams running their own desktop-agent evals, the load-bearing decision is which layer of the agent stack the verifier reads. Anything graded over the rendered UI — screenshots, accessibility trees, OCR — inherits the slipperiness OpenComputer exposes. Anything graded over inspectable state — files, configs, database rows — admits the kind of state-grounded verifier that turns benchmark numbers back into evidence. The same logic applies to production evals at scale: if the production grader is an LLM judge over outputs, the A/B harness inherits the judge's noise floor, and the team needs to either invest in deterministic post-hoc checks or accept that small wins won't be statistically distinguishable.
The trade-off is that verifier-first pipelines are expensive to bootstrap. Each of OpenComputer's 33 applications required custom executable checkers; the calibration runs require a strong agent in the loop; the verifier-aware task synthesizer is itself a generator that must be tuned. The paper's bet is that this upfront cost amortises across thousands of graded runs, which it almost certainly does for a benchmark that aspires to last several model generations. For a one-off internal eval, screenshot judges may still be the right call — just don't compare those numbers against OpenComputer's.
Goes deeper in: AI Agents → Evals & Diagnostics → Hard pass/fail and Agent Engineering → Production Evals → A/B harness
Related explainers
- EnvFactory — Synthetic envs for tool-use agent training — the training-side counterpart: EnvFactory builds verified environments for training, OpenComputer builds verifier-grounded tasks for evaluation
- RLVR — Reinforcement Learning with Verifiable Rewards — the upstream training paradigm that depends on deterministic verifiers like the ones OpenComputer formalises
- FutureSim — harness-level eval — a complementary eval pattern that grades multi-step traces; OpenComputer grades final state, FutureSim grades the path that got there
- MSR delegation — cascading fidelity loss — what happens when long-horizon agent loops have no in-loop verifier to anchor each step