What is the EnvFactory paper actually proposing?

EnvFactory is a two-stage pipeline from HKUST and Huawei for training tool-use agents. Stage one autonomously builds 85 stateful tool environments by exploring real online resources, recursively resolving tool dependencies, and verifying each environment against a calibration set — yielding 842 tools across 7 domains. Stage two samples 2,575 multi-turn trajectories topology-aware on those environments, then runs a calibrated-refinement step that rewrites the over-specified raw queries into natural human-like requests. Qwen3 backbones trained with SFT on 1,622 trajectories followed by GRPO RL on 953 trajectories show lifts of up to 15 pp on BFCL v3 multi-turn (33.50 → 48.50 on Qwen3-4B) and up to 8.6 pp on MCP-Atlas across model sizes — for Qwen3-4B specifically MCP-Atlas moves from 4.12 to 7.90, a +3.78 pp absolute lift. The headline is that quality of the training environment — not raw environment count — was the binding constraint.

How does EnvFactory differ from EnvScaler or AWM?

EnvScaler and AWM are concurrent baselines that scale tool-use training by piling on raw environment counts — by EnvFactory's count, roughly 5× more environments than EnvFactory uses. The trade-off is that each environment is less verified and the tool-dependency graph isn't enforced, so trajectories sampled in those environments are more likely to contain calls the env doesn't actually support. EnvFactory inverts the tradeoff: fewer environments, but each is stateful, has an enforced dependency topology, and is verified against a calibration set before any trajectory is sampled. The result is that EnvFactory's 85 verified environments produce more useful training signal per trajectory than baselines, and the BFCL v3 lift on Qwen3-4B is the largest reported in this family of methods.

Why does topology-aware sampling matter for tool-use training?

Real tool-use sequences respect dependency topology — you can't save a file before you open it, you can't read from a database before connecting to it. A uniform random sampler over the tool set will produce trajectories that violate these dependencies often enough to wash out the training signal. Topology-aware sampling walks the env's dependency graph from a root to a goal, so every sampled trajectory is by construction executable inside the env. That guarantees the verifier has something to grade — either the agent reaches the goal or it deviates from a known-valid path — which is the basic precondition for any RLVR-style training loop. The calibrated-refinement step then rewrites the over-specified path into a natural human-like query, so the model learns to map ambiguous user intents to the right call sequence, not to follow precise machine-generated instructions.

EnvFactory paper — Synthetic envs for tool-use agent training

EnvFactory — Synthetic envs for tool-use agent training

Agent

learnaivisually.com/ai-explained/envfactory-tool-env-synthesis

TL;DR

What is it: The EnvFactory paper from HKUST and Huawei autonomously builds 85 stateful, verified tool environments from real online resources and uses topology-aware sampling to generate the multi-turn trajectories that fine-tune Qwen3 backbones for tool use.
Why it’s needed: Tool-use agents need stateful environments to roll out against — without a verifiable environment, RL training has no reward signal and SFT has no realistic trajectories. The binding constraint on tool-use lift turned out to be environment quality, not raw environment count.
vs previous: Concurrent baselines like EnvScaler and AWM scale by piling on raw environment counts; EnvFactory uses 5× fewer environments but verifies each has a stateful tool-dependency graph, samples trajectories that respect the graph, then rewrites over-specified instructions into natural-language queries.

Jargon

BFCL v3: The Berkeley Function-Calling Leaderboard v3 — a benchmark scoring how often a model emits correct function calls in multi-turn tool-use conversations. The "multi-turn" split is the load-bearing one for agents because it tests whether the model maintains state across calls, not just whether it emits one correct call in isolation.
MCP-Atlas: An agent benchmark over the Model Context Protocol ecosystem — tasks require the agent to discover and call real MCP servers, not a hand-rolled tool registry. Pass rates are typically low single digits for small open models, so any pp lift is meaningful.
VitaBench: A tool-use benchmark testing realistic multi-step workflows (book a flight, file an expense report) rather than single function calls. Like MCP-Atlas, baselines are low so improvements there usually indicate the model actually got better at composing tool calls.
RLVR: Reinforcement Learning with Verifiable Rewards — post-training where a deterministic verifier (unit tests, equality checks, programmatic env state checks) replaces the learned reward model. Already covered in RLVR — verifiable rewards in the CoPD loop.
GRPO: Group Relative Policy Optimization — a value-free PPO variant that estimates advantage as the per-sample reward minus the group mean across sampled rollouts. Widely used for LLM RL because it drops the value head entirely.
SFT cold-start: Supervised fine-tuning applied as a warmup before RL, so the policy already emits reasonable outputs before any reward signal arrives. Without it, RL on a random initialisation wastes compute on rollouts that never get any reward.
Stateful environment: An environment whose state changes as tools mutate it — opening a file actually opens it, editing actually writes, listing a directory reflects the previous mutations. Contrast with stateless lookup APIs where every call resolves against a frozen database.
Topology-aware sampling: Paper-specific term: when sampling trajectories from a constructed environment, follow the env's tool-dependency graph rather than picking tool calls uniformly at random. A topology-aware sampler will never propose "save file" before "open file" because the dependency edge forbids it.
Calibrated refinement: Paper-specific term: the post-processing step that converts a generator's over-specified instruction list ("call open_file with path="/tmp/a.pdf", byte_offset=1266...") into a natural, ambiguous, human-like query ("could you check that PDF I just dropped in /tmp?"). Without this step the agent learns to follow precise instructions, not real users.

The news. On May 19, 2026, researchers from HKUST and Huawei posted EnvFactory — a two-stage pipeline that autonomously synthesizes 85 stateful tool environments by exploring real online resources, recursively resolves logical dependencies among the tools, builds executable interfaces over a stateful database, and verifies each environment for correctness. A topology-aware sampler then produces 2,575 multi-turn trajectories, and a calibrated-refinement step rewrites them into natural human-like queries. Training Qwen3 backbones with SFT cold-start on 1,622 trajectories followed by GRPO RL on 953 trajectories lifts BFCL v3 multi-turn by up to 15 percentage points and MCP-Atlas by up to 8.6 — with roughly 5× fewer environments than concurrent baselines EnvScaler and AWM.

Picture a culinary school choosing how to train its cooks. The obvious option is to rent time at twenty-five real restaurants and have students rotate through them — lots of variety, but every kitchen is set up slightly differently, half of them are missing one critical appliance, and nobody can guarantee that the recipes practiced there actually finish on time. The less obvious option is to build five test kitchens from scratch, each one fully stocked, fully working, and verified before a single student walks in. Every recipe practiced in those kitchens is one the kitchen genuinely supports, and every cooking session ends with a measurable result — the dish either came out or it didn't. The paper's claim is that for tool-use training, the five-test-kitchen recipe wins.

The technical reason matches the metaphor. A tool-use agent learns by rolling out inside an environment and getting graded on whether the call sequence achieves the user's goal. If the environment isn't stateful — if "opening a file" doesn't actually change what the next "list directory" call returns — then the rollout produces no signal about whether the agent's choices composed correctly. And if the environment isn't verified — if its declared tools don't actually run, or if their dependencies aren't enforced — then the reward function grades on noise. The paper's contribution is to treat the environment itself as a first-class artifact to construct, verify, and version.

Two stages: environment construction, then trajectory synthesis

Stage one is environment construction. EnvFactory's pipeline starts from a real online resource (say, a documentation site for a SaaS API), autonomously explores it to enumerate the tools the resource exposes, then recursively resolves the logical dependencies between those tools — every tool's input schema is matched against every other tool's possible outputs, building a directed graph that encodes which calls can follow which. The graph plus an executable backing database becomes the environment. Each environment is then verified by running a small calibration set through it and confirming the verifier accepts what it should and rejects what it shouldn't.

Stage two is trajectory synthesis. A topology-aware sampler walks the dependency graph from a root tool to a goal state, producing a sequence of calls that is guaranteed to be executable inside the environment. The sampler emits an over-specified instruction list — the exact path it took, with literal arguments — and then a calibrated-refinement step rewrites that into a natural query a real user might type. The pairing of "noisy human-like request" with "verified ground-truth call sequence" is exactly the training signal both SFT and RLVR need: SFT learns to imitate the call sequence on the noisy query, and the verifier grades RL rollouts against the same call sequence.

How EnvFactory compares to scale-first baselines

The paper benchmarks against two concurrent baselines that take the opposite bet — scale up the number of environments, accept that each one is less verified.

Method	Envs	Stateful	Dependency graph	Query style	Qwen3-4B BFCL v3 lift
EnvScaler (concurrent baseline)	~425, unverified (EnvFactory §5, illustrative count from paper's 5× claim)	partial	not enforced	over-specified	+~8 pp (illustrative, paper-reported range)
AWM (concurrent baseline)	~425, unverified (EnvFactory §5)	partial	not enforced	over-specified	+~6 pp (illustrative, paper-reported range)
EnvFactory	85 verified (EnvFactory §3)	yes	built per env	natural, calibrated	+15 pp (33.50 → 48.50 on Qwen3-4B, §6)

The headline isn't "fewer envs is better" in some absolute sense — the headline is that verified, stateful environments with enforced dependency topology produce trajectories whose training signal isn't washed out by env-level noise. Scale only helps once each unit is teaching the agent something the verifier can grade.

A worked example with named numbers

Concretely, here's how the math comes together for Qwen3-4B (numbers below are the paper's headline figures from §6; the per-step decomposition is illustrative). The reported BFCL v3 multi-turn score moves from 33.50 at the Qwen3-4B baseline to 48.50 after SFT + GRPO on EnvFactory trajectories — a +15 pp absolute lift. The training mix is 1,622 SFT trajectories plus 953 RL trajectories — 2,575 trajectories total. Divide by 85 environments and you get roughly 30 trajectories per environment, which is what topology-aware sampling actually generates per env in the paper's setup. Scaling raw env count by 5× without changing per-env trajectory yield would mean 425 environments × 6 trajectories per env to hit the same total — same 2,575 trajectories, but every trajectory now lives in a less-verified environment, and the per-env tool-dependency signal is correspondingly weaker. The contribution of each unit of training data was the binding constraint, not the number of environments themselves.

That decomposition rhymes with the compounding-error lens that already shapes most multi-turn agent training: every step in a trajectory either survives the verifier or it doesn't, and a long chain of marginal-quality steps performs much worse than a short chain of well-graded ones. The same logic applies one level up — every environment either has a graded dependency topology or it doesn't, and a large pile of partially-graded environments performs worse than a small pile of well-graded ones. EnvFactory's bet is that the "fewer but better" tradeoff wins at the environment level, not just the trajectory level.

What this changes for agent training pipelines

For practitioners running their own tool-use RL stack, the actionable read is that the upstream work — constructing verified environments and resolving the dependency topology — pays back further downstream than scaling rollout count or model size. The MCP-Atlas number is particularly telling: Qwen3-4B's pass rate goes from 4.12 to 7.90 (a relative ~92% improvement on a benchmark where even strong frontier models struggle to break 15), suggesting that the training signal from EnvFactory environments transfers to real-world MCP servers it never saw during training. The implication for production evals is that benchmark improvements on synthetic environments aren't being memorised — the topology-aware structure is general enough to compose against unseen tools.

The trade-off, of course, is that environment construction is expensive. Each of EnvFactory's 85 environments was built by an autonomous explore-and-verify pipeline that itself uses LLM calls — the paper doesn't publish the per-env construction cost, but it's almost certainly the dominant line in the training budget. The "fewer envs is better" result is only true if each env's construction cost is amortised across many trajectories, which is exactly what topology-aware sampling enables.

Goes deeper in: AI Agents → Tool Use & Function Calling → Tool Design and AI Agents → Evals & Diagnostics → Compounding Errors

Related explainers

RLVR — Reinforcement Learning with Verifiable Rewards — the same kind of binary verifier signal EnvFactory uses to grade tool-call sequences, but explained in the CoPD context
CoPD — co-evolving policy distillation — a different training-time topology that also bets on quality of training signal over raw scale
MSR delegation — Cascading fidelity loss — long-horizon agent loops drift when no in-loop verifier anchors the chain; EnvFactory provides the per-env verifier that drift-detection wants
MCP SEP-2663 — async task handles — the protocol surface MCP-Atlas grades agents against; EnvFactory's stateful envs match this contract

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based