The news. On June 12, 2026, researchers released CacheRL, a recipe for training small multi-step tool-calling agents with reinforcement learning at roughly 100× less compute. The trick: during RL rollouts, live tool execution is replaced by a three-tier fuzzy cache — exact, similar, and approximate matches served with token-level masking — paired with a cache-tier-aware reward that down-weights credit when the cached result is a loose match. A hybrid thinking-trajectory pipeline first seeds the agent with LLM-generated reasoning. The reported result: 92% process accuracy on multi-step tool calling, within two points of GPT-5's 94%. Read the paper →
Picture teaching a pilot to fly. Every real practice flight burns fuel and books a runway — and reinforcement learning needs the trainee to fly the same kind of mission over and over. The expensive part was never the learning; it was running a real flight for every single attempt. Train a tool-using agent the same way and the bill is identical: each rollout fires real tool calls — hitting an API, running code in a sandbox, querying a database — and the compute drains into the calls, not the policy.
CacheRL's move is to put the trainee in a flight simulator. Most of those tool calls repeat: the same lookups, the same code, only slightly reworded between rollouts. So instead of executing each one live, CacheRL serves it from a three-tier fuzzy cache built from earlier executions — an exact match replays the recorded result, a similar match adapts a close one, and an approximate match returns a best-effort guess. Token-level masking is what lets a cosmetically-different call still land a cache hit — it compares the parts of the call that matter and ignores the formatting that doesn't.
A simulator is never perfect, though, and a naive trainer would treat a fuzzy-cached result as if it were ground truth. CacheRL's fix is a cache-tier-aware reward: the credit assigned to a step is scaled by how trustworthy its cache tier was. An outcome built on an exact replay counts in full; one built on an approximate guess counts for less — so the ~100× speedup doesn't quietly corrupt the training signal. The three pieces stack like this:
| Piece | What it does | What the ablation shows |
|---|---|---|
| Hybrid thinking trajectory | Seeds RL with LLM-generated reasoning added to recorded trajectories | Removing it costs ~41% of performance (paper ablation) |
| Three-tier fuzzy cache | Serves tool results from cache instead of live execution | ~100× less rollout compute (paper) |
| Cache-tier-aware reward | Reweights credit by how good the cache match was | Adds ~17% on top (paper ablation) |
Put the compute on a scale. Say a multi-step task fires 8 tool calls per rollout, and RL grinds through run after run to converge; if each live call is a real API hit or a sandboxed code run — slow and metered — then the live-execution bill dwarfs everything else. CacheRL serves those 8 calls from the three-tier cache instead, collapsing each rollout's tool cost to a lookup. Stack that across every rollout and the paper reports roughly 100× less compute than baseline rollout training (the 8-calls figure is illustrative; the 100× is the paper's). What you buy back is the surprise: 92% process accuracy, within two points of GPT-5's 94%, from a model small enough to train this cheaply — with the validation reward climbing from 0.43 to 0.78 over training.
Goes deeper in: Agent Engineering → Cost & Latency Engineering → Result caching
Related explainers
- GrepSeek — GRPO over shell-command search — the other end of the dial: RL that trains the agent's tool use itself, where CacheRL makes that training cheap
- SearchSwarm — distilled delegation — like CacheRL's hybrid-thinking step, it moves a stronger model's behavior into a smaller, cheaper one
- EvoMem — patch-based agent memory — caching at inference time rather than training time: reuse what the agent already learned instead of recomputing it