What are cached rollouts for agent RL?

Cached rollouts replace the live tool calls an agent makes during reinforcement-learning practice with results served from a cache. CacheRL (arXiv 2606.14179) uses a three-tier fuzzy cache — exact, similar, and approximate matches with token-level masking — so most tool calls become lookups instead of real API hits or code runs, cutting training compute by roughly 100×.

Why does CacheRL matter?

Training tool-using agents with RL is dominated by the cost of executing every tool call on every rollout, which makes it expensive and slow. By caching those results, CacheRL trains a compact agent to 92% process accuracy — within two points of GPT-5's 94% — at about 100× less compute, sharply lowering the cost barrier to training tool-using agents with RL.

How does the cache-tier-aware reward avoid corrupting training?

A fuzzy cache sometimes returns an approximate result, and treating it as ground truth would teach the agent the wrong lesson. The cache-tier-aware reward scales the credit for each step by how trustworthy its cache tier was — full credit for an exact replay, less for an approximate match — which the paper's ablation shows adds about 17% on top of caching alone.

CacheRL trains tool-calling agents via cached rollouts at 100× less compute — Cached rollouts for agent RL

TL;DR

What is it: The CacheRL paper (arXiv 2606.14179) trains compact multi-step tool-calling agents with reinforcement learning, but replaces live tool execution during each rollout with a three-tier fuzzy cache — reaching 92% process accuracy, within two points of GPT-5's 94%.
Why it’s needed: RL on tool-using agents is dominated not by the learning but by the cost of actually running every tool call on every practice rollout. Caching those results turns that cost into a lookup, making it possible to train a small agent at ~100× less compute.
vs previous: Standard agent RL executes tools live on every rollout — slow, expensive, and often not reproducible. CacheRL serves them from a cache instead and calibrates the reward to how good each cache match is, so the speedup doesn't poison the training signal.

Jargon

RL rollout: One full practice attempt during reinforcement learning: the agent runs its whole think → call a tool → read the result → repeat loop to the end, and the outcome becomes a training signal. Tool-using rollouts are expensive because each step actually executes a tool.
Multi-step tool-calling agent: An agent that solves a task by making a sequence of tool calls — search, then code, then a lookup — each depending on the last. See AI Agents → Tool Use.
Process accuracy: Grading whether the agent took the right steps along the way, not just whether the final answer matched. It's a stricter bar than outcome accuracy, which a lucky end-state can pass.
Three-tier fuzzy cache: A store of past tool results matched at three levels of strictness — exact, similar, and approximate — so a near-miss query can still be served from memory instead of a live call.
Token-level masking: When comparing a new query to cached ones, the system ignores the tokens that don't change the result (formatting, irrelevant arguments), so cosmetically-different calls still hit the cache.
Cache-tier-aware reward: The reward is down-weighted when the served result came from a looser cache tier, so the agent isn't credited (or blamed) for outcomes built on a shaky approximation. This is the cache analogue of a result cache with confidence attached.
Hybrid thinking trajectory: A data-prep step that augments recorded agent trajectories with LLM-generated reasoning before RL, transferring a stronger model's chain of thought into the small agent. The paper's ablation shows it carries a large share of the final score.

The news. On June 12, 2026, researchers released CacheRL, a recipe for training small multi-step tool-calling agents with reinforcement learning at roughly 100× less compute. The trick: during RL rollouts, live tool execution is replaced by a three-tier fuzzy cache — exact, similar, and approximate matches served with token-level masking — paired with a cache-tier-aware reward that down-weights credit when the cached result is a loose match. A hybrid thinking-trajectory pipeline first seeds the agent with LLM-generated reasoning. The reported result: 92% process accuracy on multi-step tool calling, within two points of GPT-5's 94%. Read the paper →

Picture teaching a pilot to fly. Every real practice flight burns fuel and books a runway — and reinforcement learning needs the trainee to fly the same kind of mission over and over. The expensive part was never the learning; it was running a real flight for every single attempt. Train a tool-using agent the same way and the bill is identical: each rollout fires real tool calls — hitting an API, running code in a sandbox, querying a database — and the compute drains into the calls, not the policy.

CacheRL's move is to put the trainee in a flight simulator. Most of those tool calls repeat: the same lookups, the same code, only slightly reworded between rollouts. So instead of executing each one live, CacheRL serves it from a three-tier fuzzy cache built from earlier executions — an exact match replays the recorded result, a similar match adapts a close one, and an approximate match returns a best-effort guess. Token-level masking is what lets a cosmetically-different call still land a cache hit — it compares the parts of the call that matter and ignores the formatting that doesn't.

A simulator is never perfect, though, and a naive trainer would treat a fuzzy-cached result as if it were ground truth. CacheRL's fix is a cache-tier-aware reward: the credit assigned to a step is scaled by how trustworthy its cache tier was. An outcome built on an exact replay counts in full; one built on an approximate guess counts for less — so the ~100× speedup doesn't quietly corrupt the training signal. The three pieces stack like this:

Piece	What it does	What the ablation shows
Hybrid thinking trajectory	Seeds RL with LLM-generated reasoning added to recorded trajectories	Removing it costs ~41% of performance (paper ablation)
Three-tier fuzzy cache	Serves tool results from cache instead of live execution	~100× less rollout compute (paper)
Cache-tier-aware reward	Reweights credit by how good the cache match was	Adds ~17% on top (paper ablation)

Put the compute on a scale. Say a multi-step task fires 8 tool calls per rollout, and RL grinds through run after run to converge; if each live call is a real API hit or a sandboxed code run — slow and metered — then the live-execution bill dwarfs everything else. CacheRL serves those 8 calls from the three-tier cache instead, collapsing each rollout's tool cost to a lookup. Stack that across every rollout and the paper reports roughly 100× less compute than baseline rollout training (the 8-calls figure is illustrative; the 100× is the paper's). What you buy back is the surprise: 92% process accuracy, within two points of GPT-5's 94%, from a model small enough to train this cheaply — with the validation reward climbing from 0.43 to 0.78 over training.

Goes deeper in: Agent Engineering → Cost & Latency Engineering → Result caching

Related explainers

GrepSeek — GRPO over shell-command search — the other end of the dial: RL that trains the agent's tool use itself, where CacheRL makes that training cheap
SearchSwarm — distilled delegation — like CacheRL's hybrid-thinking step, it moves a stronger model's behavior into a smaller, cheaper one
EvoMem — patch-based agent memory — caching at inference time rather than training time: reuse what the agent already learned instead of recomputing it

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based