What is Forge RL prefix-tree merging?

Forge RL is the reinforcement-learning training infrastructure behind MiniMax-M2 (arXiv 2605.26494, May 2026). Prefix-tree merging is its core efficiency trick: when RL samples many multi-turn rollouts that share a common opening — the same system prompt and first turns — Forge merges them into a tree so the shared opening becomes a single trunk computed once, with the points where rollouts diverge becoming branches. Only the branches cost extra work, and the trunk's forward pass and KV cache are reused by every rollout hanging off it. The paper reports this delivers a 40× training speedup.

Why does merging rollouts matter for RL training?

Reinforcement-learning post-training spends most of its compute generating and re-processing rollouts, and for multi-turn agentic tasks many rollouts begin identically before they diverge. A naive pipeline re-runs that shared opening through the model once per rollout, so the same work is repeated dozens of times. Merging the rollouts into a prefix tree computes the shared opening exactly once, which is what lets MiniMax-M2 run large-scale agentic RL — including a self-evolution loop that reportedly gained about 30% over 100 autonomous iteration rounds — at a practical cost.

How is it different from prefix caching at serving time?

They are the same idea pointed at different workloads. Prefix caching reuses the KV cache of a matching opening across separate inference requests, so a new request that starts like a cached one skips re-computing the shared prefix. Forge's prefix-tree merging applies that reuse inside the training loop instead: the shared opening of many sampled rollouts is computed once and reused across the branches of one tree. Serving caches across requests; Forge merges across rollouts — but both win by never computing a shared prefix twice.

MiniMax-M2 ships a 230B open MoE with 40× faster RL training — Forge RL prefix-tree merging

TL;DR

What is it: The MiniMax-M2 release (arXiv 2605.26494) is a 229.9-billion-parameter open mixture-of-experts model, and its deepest contribution is Forge RL — the reinforcement-learning training system whose prefix-tree merging trick this article explains.
Why it’s needed: Reinforcement-learning post-training spends most of its time generating and re-processing multi-turn rollouts; when many of them share the same opening, that shared work gets done again and again, so deduplicating it is what makes large-scale agentic RL affordable.
vs previous: A naive rollout pipeline treats every trajectory as an independent flat sequence and recomputes the shared opening once per rollout; prefix-tree merging computes that opening once and forks only the divergent tails, cutting the repeated work by a reported 40×.

Jargon

Rollout: One full multi-turn playthrough of a task that the model samples during reinforcement learning — the model acts, the environment responds, repeat. RL improves the model by scoring many rollouts, so generating and processing them is most of the training cost.
Prefix tree (radix tree): A tree in which paths that begin with the same sequence share a single branch, splitting only where they differ. Forge stores rollouts this way so the shared opening is one trunk; it is the same structure the serving stack uses for prefix caching.
Forge RL: MiniMax-M2's reinforcement-learning training infrastructure. It separates the agent-and-environment side from the training-and-inference side behind a Gateway, schedules rollouts, and merges shared prefixes into trees — the source of the 40× speedup.
KV cache: The stored keys and values for tokens already processed, so attention never recomputes them. Merging shares the trunk's KV cache across branches; it is also the dominant memory cost of inference.
Mixture of Experts (MoE): An architecture where each token is routed to only a few of many expert sub-networks. MiniMax-M2 has 256 experts but activates only 8 (9.8B of 229.9B parameters) per token, keeping compute low for a large model.
Windowed FIFO scheduling: Forge only feeds the freshest rollouts (a window of W = 0.3N) into each training update, rather than all of them — a knob that trades training stability against throughput.

The news. On May 26, 2026, MiniMax released the M2 technical report (arXiv 2605.26494), documenting a 229.9-billion-parameter open model that activates only 9.8 billion parameters per token across 256 experts (8 active). Alongside the architecture, the paper details Forge RL, the reinforcement-learning infrastructure used to train it — and its headline efficiency trick: merging multi-turn rollouts that share a prefix into trees for a 40× training speedup. Read the report →

Picture a choose-your-own-adventure book. Dozens of readers start on the same first page and read the same opening chapters word for word — the story only branches when someone hits a "turn to page 40" choice. A wasteful publisher would print a separate, complete book for every possible ending, re-typesetting those shared opening chapters dozens of times over. The thrifty publisher typesets the shared spine once, then forks the paper only at the choice points. Same stories reach the same readers, with a fraction of the printing.

Reinforcement-learning training of an agent has exactly this shape. To improve the model you sample many rollouts — full multi-turn playthroughs of a task — and most of them branch from the same starting state: the same system prompt, the same first user turn, often the same opening model replies. A naive pipeline treats each rollout as an independent sequence and pushes it through the model from the top, which means the shared opening is run through the model again for every single rollout — the publisher reprinting the opening chapters, dozens of times.

Forge RL — MiniMax-M2's training system — merges every rollout that shares a prefix into a single tree. The shared opening becomes one trunk, computed a single time; the points where rollouts diverge become branches, and only those branches cost extra work. Because the model's attention over the opening lives in the KV cache — the keys and values for those shared tokens — the trunk is computed once and every branch hanging off it reuses it. This is the same insight as serving-time prefix caching, turned inward on the training loop instead of the request stream.

The tree is not the only piece of Forge. It places a Gateway between the agent-and-environment side and the training-and-inference side, and uses windowed FIFO scheduling — only the freshest W = 0.3N rollouts feed each update — to balance training stability against throughput. But the tree is the piece this explainer focuses on, and the report attributes the 40× speedup specifically to it: the more the rollouts share, the more of the forward pass it spares.

Walk the numbers on one group of rollouts. Say RL samples a group of 8 rollouts that all share a 2,000-token opening, then each diverges for another 500 tokens (illustrative — real openings and groups are far larger). The naive pipeline runs the full sequence for each: 8 × (2,000 + 500) = 20,000 token-forward-passes, of which the opening alone is 8 × 2,000 = 16,000 — and 14,000 of those are redundant repeats of the same trunk. Tree-merged, the 2,000-token trunk is computed once, then 8 branches × 500 = 4,000, for a total of 6,000 token-passes — about 3.3× less on this toy. Real agent rollout trees branch far more and share far longer prefixes, which is how Forge reaches its reported 40× training speedup.

Strategy	What it does with the shared opening	Repeated work	Where it earns its keep
Flat per-rollout replay (naive)	Runs it through the model once for every rollout	Grows with the rollout count	Simple, but most of the forward pass is duplicated (conceptual baseline — no source figure)
Prefix-tree merging (Forge RL)	Computes it once, forks only the divergent tails [paper]	Bounded by the tree's branching, not the rollout count	Large-scale multi-turn agentic RL — reported 40× training speedup [paper]
Prefix caching (serving cousin)	Reuses a matching opening across requests at inference	None, when openings match	The same idea applied to serving, not training (see the Prefix Caching module)

Two honest caveats. The 40× is the paper's reported figure for its full Forge RL system, and the win scales with how much rollouts actually share — tasks whose trajectories diverge almost immediately leave little trunk to merge. And a tree only helps the prefix: the moment two rollouts pick different actions, everything after is genuinely different work the model still has to do. But the lesson generalizes past one model. Once you notice that sampled rollouts mostly retrace the same opening, the question stops being "how do we run more rollouts" and becomes "how little of each rollout is actually new" — and a prefix tree answers it by computing the shared part exactly once.

Goes deeper in: LLM Serving → Prefix Caching → The prefix tree

Related explainers

This concept was unbundled from the MiniMax-M2 report on its own, so it has no same-run siblings — but it sits next to several explainers that attack the same shared-prefix redundancy from different angles:

CacheWeaver — prefix-cache-aware evidence reordering — the serving-time cousin: reuse a matching opening across RAG requests instead of across RL rollouts.
CacheRL — cached rollouts for agent RL — also cuts redundant RL rollout work, but by caching whole rollouts rather than merging their shared prefixes into a tree.
EfficientRollout — quantized self-drafters — speeds the generation of each RL rollout with self-speculative decoding; Forge speeds how the batch of rollouts is processed.
MiniMax-M3 MSA — block-sparse attention — the same lab's later model, attacking efficiency at the attention layer rather than the training loop.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based