EnvFactory — Synthetic envs for tool-use agent training
AgentThe news. On May 19, 2026, researchers from HKUST and Huawei posted EnvFactory — a two-stage pipeline that autonomously synthesizes 85 stateful tool environments by exploring real online resources, recursively resolves logical dependencies among the tools, builds executable interfaces over a stateful database, and verifies each environment for correctness. A topology-aware sampler then produces 2,575 multi-turn trajectories, and a calibrated-refinement step rewrites them into natural human-like queries. Training Qwen3 backbones with SFT cold-start on 1,622 trajectories followed by GRPO RL on 953 trajectories lifts BFCL v3 multi-turn by up to 15 percentage points and MCP-Atlas by up to 8.6 — with roughly 5× fewer environments than concurrent baselines EnvScaler and AWM.
Picture a culinary school choosing how to train its cooks. The obvious option is to rent time at twenty-five real restaurants and have students rotate through them — lots of variety, but every kitchen is set up slightly differently, half of them are missing one critical appliance, and nobody can guarantee that the recipes practiced there actually finish on time. The less obvious option is to build five test kitchens from scratch, each one fully stocked, fully working, and verified before a single student walks in. Every recipe practiced in those kitchens is one the kitchen genuinely supports, and every cooking session ends with a measurable result — the dish either came out or it didn't. The paper's claim is that for tool-use training, the five-test-kitchen recipe wins.
The technical reason matches the metaphor. A tool-use agent learns by rolling out inside an environment and getting graded on whether the call sequence achieves the user's goal. If the environment isn't stateful — if "opening a file" doesn't actually change what the next "list directory" call returns — then the rollout produces no signal about whether the agent's choices composed correctly. And if the environment isn't verified — if its declared tools don't actually run, or if their dependencies aren't enforced — then the reward function grades on noise. The paper's contribution is to treat the environment itself as a first-class artifact to construct, verify, and version.
Two stages: environment construction, then trajectory synthesis
Stage one is environment construction. EnvFactory's pipeline starts from a real online resource (say, a documentation site for a SaaS API), autonomously explores it to enumerate the tools the resource exposes, then recursively resolves the logical dependencies between those tools — every tool's input schema is matched against every other tool's possible outputs, building a directed graph that encodes which calls can follow which. The graph plus an executable backing database becomes the environment. Each environment is then verified by running a small calibration set through it and confirming the verifier accepts what it should and rejects what it shouldn't.
Stage two is trajectory synthesis. A topology-aware sampler walks the dependency graph from a root tool to a goal state, producing a sequence of calls that is guaranteed to be executable inside the environment. The sampler emits an over-specified instruction list — the exact path it took, with literal arguments — and then a calibrated-refinement step rewrites that into a natural query a real user might type. The pairing of "noisy human-like request" with "verified ground-truth call sequence" is exactly the training signal both SFT and RLVR need: SFT learns to imitate the call sequence on the noisy query, and the verifier grades RL rollouts against the same call sequence.
How EnvFactory compares to scale-first baselines
The paper benchmarks against two concurrent baselines that take the opposite bet — scale up the number of environments, accept that each one is less verified.
| Method | Envs | Stateful | Dependency graph | Query style | Qwen3-4B BFCL v3 lift |
|---|---|---|---|---|---|
| EnvScaler (concurrent baseline) | ~425, unverified (EnvFactory §5, illustrative count from paper's 5× claim) | partial | not enforced | over-specified | +~8 pp (illustrative, paper-reported range) |
| AWM (concurrent baseline) | ~425, unverified (EnvFactory §5) | partial | not enforced | over-specified | +~6 pp (illustrative, paper-reported range) |
| EnvFactory | 85 verified (EnvFactory §3) | yes | built per env | natural, calibrated | +15 pp (33.50 → 48.50 on Qwen3-4B, §6) |
The headline isn't "fewer envs is better" in some absolute sense — the headline is that verified, stateful environments with enforced dependency topology produce trajectories whose training signal isn't washed out by env-level noise. Scale only helps once each unit is teaching the agent something the verifier can grade.
A worked example with named numbers
Concretely, here's how the math comes together for Qwen3-4B (numbers below are the paper's headline figures from §6; the per-step decomposition is illustrative). The reported BFCL v3 multi-turn score moves from 33.50 at the Qwen3-4B baseline to 48.50 after SFT + GRPO on EnvFactory trajectories — a +15 pp absolute lift. The training mix is 1,622 SFT trajectories plus 953 RL trajectories — 2,575 trajectories total. Divide by 85 environments and you get roughly 30 trajectories per environment, which is what topology-aware sampling actually generates per env in the paper's setup. Scaling raw env count by 5× without changing per-env trajectory yield would mean 425 environments × 6 trajectories per env to hit the same total — same 2,575 trajectories, but every trajectory now lives in a less-verified environment, and the per-env tool-dependency signal is correspondingly weaker. The contribution of each unit of training data was the binding constraint, not the number of environments themselves.
That decomposition rhymes with the compounding-error lens that already shapes most multi-turn agent training: every step in a trajectory either survives the verifier or it doesn't, and a long chain of marginal-quality steps performs much worse than a short chain of well-graded ones. The same logic applies one level up — every environment either has a graded dependency topology or it doesn't, and a large pile of partially-graded environments performs worse than a small pile of well-graded ones. EnvFactory's bet is that the "fewer but better" tradeoff wins at the environment level, not just the trajectory level.
What this changes for agent training pipelines
For practitioners running their own tool-use RL stack, the actionable read is that the upstream work — constructing verified environments and resolving the dependency topology — pays back further downstream than scaling rollout count or model size. The MCP-Atlas number is particularly telling: Qwen3-4B's pass rate goes from 4.12 to 7.90 (a relative ~92% improvement on a benchmark where even strong frontier models struggle to break 15), suggesting that the training signal from EnvFactory environments transfers to real-world MCP servers it never saw during training. The implication for production evals is that benchmark improvements on synthetic environments aren't being memorised — the topology-aware structure is general enough to compose against unseen tools.
The trade-off, of course, is that environment construction is expensive. Each of EnvFactory's 85 environments was built by an autonomous explore-and-verify pipeline that itself uses LLM calls — the paper doesn't publish the per-env construction cost, but it's almost certainly the dominant line in the training budget. The "fewer envs is better" result is only true if each env's construction cost is amortised across many trajectories, which is exactly what topology-aware sampling enables.
Goes deeper in: AI Agents → Tool Use & Function Calling → Tool Design and AI Agents → Evals & Diagnostics → Compounding Errors
Related explainers
- RLVR — Reinforcement Learning with Verifiable Rewards — the same kind of binary verifier signal EnvFactory uses to grade tool-call sequences, but explained in the CoPD context
- CoPD — co-evolving policy distillation — a different training-time topology that also bets on quality of training signal over raw scale
- MSR delegation — Cascading fidelity loss — long-horizon agent loops drift when no in-loop verifier anchors the chain; EnvFactory provides the per-env verifier that drift-detection wants
- MCP SEP-2663 — async task handles — the protocol surface MCP-Atlas grades agents against; EnvFactory's stateful envs match this contract