What is the Workflow-GYM benchmark?

Workflow-GYM, released by ByteDance Seed, is a benchmark for long-horizon computer-use agents. It drops agents into real professional software environments and asks them to complete extended, multi-stage workflows end to end through the GUI, scoring only whether the final state is correct. The headline result is that state-of-the-art models succeed on only slightly above 30% of the workflows, and the paper catalogs recurring failure modes: stage omission, error propagation, objective drift, and weak understanding of professional software.

How is end-to-end workflow completion different from a single-step benchmark?

A single-step benchmark hands the agent one isolated GUI action — click a button, fill a field — and grades it against a fixed answer. It is cheap and reproducible but cannot see anything that only goes wrong across multiple dependent steps. End-to-end completion runs the agent through the whole multi-stage workflow and grades only the end state, which surfaces long-horizon failures — skipping a stage, propagating an early error, or drifting from the original goal — that single-step grading is structurally blind to.

Why do frontier agents only reach about 30% on Workflow-GYM?

Because per-step competence compounds. Even an agent that completes each individual stage 80% of the time only finishes a five-stage workflow end to end about 33% of the time (0.8^5 is about 0.33), since every stage has to land for the final artifact to be correct. Add the named failure modes — stage omission, error propagation, objective drift — and a model that looks reliable click-by-click becomes fragile over a full job. Workflow-GYM grades computer-use agents directly against that compounding.

Workflow-GYM scores computer-use agents at ~30% on pro tasks — End-to-end GUI workflow completion

TL;DR

What is it: The Workflow-GYM benchmark from ByteDance Seed drops computer-use agents into real professional software and asks them to finish long, economically valuable jobs end to end through the GUI — and it grades only one thing: did the whole multi-stage workflow reach the correct final state.
Why it’s needed: Most agent demos show off a single slick action — open a menu, click a button, fill a field. Real office work is a chain of dependent stages, and the score that matters in production is whether the finished artifact is correct, not whether any one step looked right.
vs previous: A single-step benchmark grades one isolated GUI action against a fixed answer, so it can't see failures that only appear over a long horizon. Workflow-GYM grades the end state of the entire workflow, which exposes stage omission, error propagation, and objective drift — and drops state-of-the-art models to only ~30%.

Jargon

Computer-use agent: An agent that operates ordinary software the way a person does — reading the screen and driving it with mouse and keyboard — instead of calling a clean text API. The interface is the GUI, so the agent has to perceive, click, type, and recover from whatever the app actually shows.
Workflow-GYM: The benchmark introduced by ByteDance Seed. It instruments real professional software environments and scores whether an agent can autonomously drive a multi-stage workflow to a correct end state. Headline result: state-of-the-art models succeed only slightly above 30%.
End-to-end completion: Grading the final state of the whole job, not the individual steps. A run only counts if every required stage happens, in order, with correct hand-offs — so a workflow that is 90% right but botches the last stage scores the same as one that never started.
Stage omission: A long-horizon failure mode where the agent skips a required stage — e.g. exporting a report without the approval step. Single-step grading can't catch it because there is no later step to check against.
Error propagation: An early mistake that corrupts everything downstream: a wrong value entered in stage two quietly poisons stages three, four, and five. The agent often looks confident the whole way because each step is locally plausible.
Objective drift: The agent loses the original goal mid-run — it starts processing invoice #1, gets distracted by a dialog, and ends up reconciling a different account. The longer the workflow, the more room there is to drift.

The news. On June 9, 2026, ByteDance Seed released Workflow-GYM, a benchmark for long-horizon computer-use agents. It drops agents into specialized professional software and asks them to complete extended, economically valuable workflows end to end through the GUI. State-of-the-art models succeed only slightly above 30%, and the paper catalogs a recurring set of failure modes — stage omission, error propagation, objective drift, and weak understanding of professional software. Read the paper →

Picture the temp. It's their first morning, the accounting suite is one they've never opened, and the task is "process this batch of invoices." A demo would stop at the impressive part: look, the temp found the right menu and clicked Save. But the actual job is a chain — open the invoice, match it to a purchase order, enter the figures, route it for approval, then file and mark it paid. The job is done only when the last stage lands correctly, and every stage depends on the one before it.

That chain is the whole point of Workflow-GYM. A single-step benchmark hands the agent one isolated action and grades it against a fixed answer — useful, cheap, reproducible, but blind to anything that only goes wrong across steps. Workflow-GYM instead runs the agent through the entire multi-stage workflow in real software and scores only the end state: is the invoice actually filed and paid, or not? Because it grades the finished artifact, it can finally see the failures that compound over a long horizon.

And those failures have names. The paper's taxonomy (Workflow-GYM, arXiv 2606.11042) maps almost one-for-one onto the temp going wrong:

Failure mode	What it looks like in the GUI	Caught by single-step grading?	Caught by end-to-end completion?
Stage omission	Skips a required stage — exports the report but never routes it for approval	No — there is no later step to check the skip against	Yes — the end state is missing the approval, so the run fails
Error propagation	A wrong figure in stage two flows into the totals in stages three–five	No — each step is locally plausible	Yes — the final artifact is wrong even though every click "worked"
Objective drift	Forgets it was processing invoice #1 and starts reconciling a different account	No — the lone step still completes	Yes — the end state answers the wrong question
Weak software understanding	Misreads the app's menus, modes, or affordances — hunts for a feature that isn't there	Partly — a wrong click is visible	Yes — the misunderstanding stalls the whole workflow

Why "~30%" is the punchline

Here is why end-to-end scoring is so much harsher than it sounds. Suppose an agent nails each individual stage 80% of the time (illustrative) — genuinely competent on any single click. A five-stage workflow only completes if every stage lands: 0.80 × 0.80 × 0.80 × 0.80 × 0.80 ≈ 0.33. The same agent that looks 80% good step-by-step is only 33% good end-to-end — which is why an end-to-end score collapses far below per-step competence, into the range Workflow-GYM reports. Lengthen the workflow and the gap only widens: at ten stages, 0.80¹⁰ ≈ 0.11. Per-step competence multiplies into end-to-end fragility. This is the same compounding-error math that makes long agent rollouts so brittle — and Workflow-GYM grades computer-use agents against it directly.

That reframes what the score is even measuring. A single-step benchmark rewards a model for being a good clicker; Workflow-GYM rewards an agent for being a reliable worker. The four named failure modes are simply the ways the worker falls apart over a long shift — and they are precisely the things you'd watch for in shadow mode before trusting an agent with a real professional workflow, and the ones you'd want a production eval to flag long before a customer does.

Goes deeper in: AI Agents → Evals & Diagnostics → Compounding errors

Related explainers

FutureSim — Harness-level agent eval vs single-shot QA — the same "score the whole run, not one answer" idea, applied to news forecasting instead of GUI work
AutoLab — Iterative experiment-loop evaluation — long-horizon agent eval for R&D tasks rather than office software
PushBench — Quantitative Goal Persistence (QGP) — isolates the "objective drift" failure mode as its own metric
AdaPlanBench — Adaptive replanning under hidden constraints — what an agent has to do well to survive a multi-stage workflow

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based