The news. On May 26, 2026, MiniMax released the M2 technical report (arXiv 2605.26494), documenting a 229.9-billion-parameter open model that activates only 9.8 billion parameters per token across 256 experts (8 active). Alongside the architecture, the paper details Forge RL, the reinforcement-learning infrastructure used to train it — and its headline efficiency trick: merging multi-turn rollouts that share a prefix into trees for a 40× training speedup. Read the report →

Picture a choose-your-own-adventure book. Dozens of readers start on the same first page and read the same opening chapters word for word — the story only branches when someone hits a "turn to page 40" choice. A wasteful publisher would print a separate, complete book for every possible ending, re-typesetting those shared opening chapters dozens of times over. The thrifty publisher typesets the shared spine once, then forks the paper only at the choice points. Same stories reach the same readers, with a fraction of the printing.

Reinforcement-learning training of an agent has exactly this shape. To improve the model you sample many rollouts — full multi-turn playthroughs of a task — and most of them branch from the same starting state: the same system prompt, the same first user turn, often the same opening model replies. A naive pipeline treats each rollout as an independent sequence and pushes it through the model from the top, which means the shared opening is run through the model again for every single rollout — the publisher reprinting the opening chapters, dozens of times.

Forge RL — MiniMax-M2's training system — merges every rollout that shares a prefix into a single tree. The shared opening becomes one trunk, computed a single time; the points where rollouts diverge become branches, and only those branches cost extra work. Because the model's attention over the opening lives in the KV cache — the keys and values for those shared tokens — the trunk is computed once and every branch hanging off it reuses it. This is the same insight as serving-time prefix caching, turned inward on the training loop instead of the request stream.

The tree is not the only piece of Forge. It places a Gateway between the agent-and-environment side and the training-and-inference side, and uses windowed FIFO scheduling — only the freshest W = 0.3N rollouts feed each update — to balance training stability against throughput. But the tree is the piece this explainer focuses on, and the report attributes the 40× speedup specifically to it: the more the rollouts share, the more of the forward pass it spares.

Walk the numbers on one group of rollouts. Say RL samples a group of 8 rollouts that all share a 2,000-token opening, then each diverges for another 500 tokens (illustrative — real openings and groups are far larger). The naive pipeline runs the full sequence for each: 8 × (2,000 + 500) = 20,000 token-forward-passes, of which the opening alone is 8 × 2,000 = 16,000 — and 14,000 of those are redundant repeats of the same trunk. Tree-merged, the 2,000-token trunk is computed once, then 8 branches × 500 = 4,000, for a total of 6,000 token-passes — about 3.3× less on this toy. Real agent rollout trees branch far more and share far longer prefixes, which is how Forge reaches its reported 40× training speedup.

StrategyWhat it does with the shared openingRepeated workWhere it earns its keep
Flat per-rollout replay (naive)Runs it through the model once for every rolloutGrows with the rollout countSimple, but most of the forward pass is duplicated (conceptual baseline — no source figure)
Prefix-tree merging (Forge RL)Computes it once, forks only the divergent tails [paper]Bounded by the tree's branching, not the rollout countLarge-scale multi-turn agentic RL — reported 40× training speedup [paper]
Prefix caching (serving cousin)Reuses a matching opening across requests at inferenceNone, when openings matchThe same idea applied to serving, not training (see the Prefix Caching module)

Two honest caveats. The 40× is the paper's reported figure for its full Forge RL system, and the win scales with how much rollouts actually share — tasks whose trajectories diverge almost immediately leave little trunk to merge. And a tree only helps the prefix: the moment two rollouts pick different actions, everything after is genuinely different work the model still has to do. But the lesson generalizes past one model. Once you notice that sampled rollouts mostly retrace the same opening, the question stops being "how do we run more rollouts" and becomes "how little of each rollout is actually new" — and a prefix tree answers it by computing the shared part exactly once.

Goes deeper in: LLM Serving → Prefix Caching → The prefix tree

Related explainers

This concept was unbundled from the MiniMax-M2 report on its own, so it has no same-run siblings — but it sits next to several explainers that attack the same shared-prefix redundancy from different angles:

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based