The news. On June 9, 2026, ByteDance Seed released Workflow-GYM, a benchmark for long-horizon computer-use agents. It drops agents into specialized professional software and asks them to complete extended, economically valuable workflows end to end through the GUI. State-of-the-art models succeed only slightly above 30%, and the paper catalogs a recurring set of failure modes — stage omission, error propagation, objective drift, and weak understanding of professional software. Read the paper →

Picture the temp. It's their first morning, the accounting suite is one they've never opened, and the task is "process this batch of invoices." A demo would stop at the impressive part: look, the temp found the right menu and clicked Save. But the actual job is a chain — open the invoice, match it to a purchase order, enter the figures, route it for approval, then file and mark it paid. The job is done only when the last stage lands correctly, and every stage depends on the one before it.

That chain is the whole point of Workflow-GYM. A single-step benchmark hands the agent one isolated action and grades it against a fixed answer — useful, cheap, reproducible, but blind to anything that only goes wrong across steps. Workflow-GYM instead runs the agent through the entire multi-stage workflow in real software and scores only the end state: is the invoice actually filed and paid, or not? Because it grades the finished artifact, it can finally see the failures that compound over a long horizon.

And those failures have names. The paper's taxonomy (Workflow-GYM, arXiv 2606.11042) maps almost one-for-one onto the temp going wrong:

Failure modeWhat it looks like in the GUICaught by single-step grading?Caught by end-to-end completion?
Stage omissionSkips a required stage — exports the report but never routes it for approvalNo — there is no later step to check the skip againstYes — the end state is missing the approval, so the run fails
Error propagationA wrong figure in stage two flows into the totals in stages three–fiveNo — each step is locally plausibleYes — the final artifact is wrong even though every click "worked"
Objective driftForgets it was processing invoice #1 and starts reconciling a different accountNo — the lone step still completesYes — the end state answers the wrong question
Weak software understandingMisreads the app's menus, modes, or affordances — hunts for a feature that isn't therePartly — a wrong click is visibleYes — the misunderstanding stalls the whole workflow

Why "~30%" is the punchline

Here is why end-to-end scoring is so much harsher than it sounds. Suppose an agent nails each individual stage 80% of the time (illustrative) — genuinely competent on any single click. A five-stage workflow only completes if every stage lands: 0.80 × 0.80 × 0.80 × 0.80 × 0.80 ≈ 0.33. The same agent that looks 80% good step-by-step is only 33% good end-to-end — which is why an end-to-end score collapses far below per-step competence, into the range Workflow-GYM reports. Lengthen the workflow and the gap only widens: at ten stages, 0.80¹⁰ ≈ 0.11. Per-step competence multiplies into end-to-end fragility. This is the same compounding-error math that makes long agent rollouts so brittle — and Workflow-GYM grades computer-use agents against it directly.

That reframes what the score is even measuring. A single-step benchmark rewards a model for being a good clicker; Workflow-GYM rewards an agent for being a reliable worker. The four named failure modes are simply the ways the worker falls apart over a long shift — and they are precisely the things you'd watch for in shadow mode before trusting an agent with a real professional workflow, and the ones you'd want a production eval to flag long before a customer does.

Goes deeper in: AI Agents → Evals & Diagnostics → Compounding errors

Related explainers

Frequently Asked Questions