The news. On June 25, 2026, researchers posted OPID (On-Policy Skill Distillation), a recipe that adds dense guidance to sparse-reward agentic RL. Rather than an outside teacher or a hand-written rubric, OPID mines skills directly from the agent's own completed on-policy trajectories at two granularities — an episode-level strategy for the whole task (the global workflows that worked, plus failure-avoidance rules) and step-level decisions for the moments that mattered — then uses a critical-first routing mechanism to choose which skill applies at each step and blends it into training as a token-level self-distillation signal. It is evaluated on ALFWorld, WebShop, and search-based QA. Read the paper →
Picture a team that just won a hard match. Under the usual rule, all they get back is the final scoreboard: a win. Train an agent with outcome-based reinforcement learning and the feedback is just as thin — after a long run of tool calls and decisions, it learns one thing: the task succeeded, or it didn't. That single end-of-run signal is a sparse reward, and on a long task it cannot say which of the dozen intermediate decisions actually earned the win — the credit smears across the whole trajectory, exactly the compounding-error problem that makes long-horizon agents brittle.
OPID's move is to roll the game film. The team's own match film already contains the answer: the whole-game plan that worked, the few clutch moves that swung it, and the mistakes worth never repeating. OPID extracts these from the agent's own completed trajectories — an episode-level skill (the high-level strategy, including the failure-avoidance rules for what to skip) and step-level skills (which intermediate actions mattered). Crucially, the film is the team's own: there is no star team's playbook to copy and no human-written rubric to score against. The supervision comes from the agent's own runs, distilled back into itself.
Two skill levels raise a question — at any given decision, which one should guide the agent? OPID answers with a critical-first routing mechanism, like a coach who reviews the pivotal moments first, that selects whether the episode-level strategy or a step-level move applies at each step. That chosen skill becomes a token-level self-distillation target, blended with the usual end-of-run reward signal from RL. The result is dense guidance on the intermediate steps the sparse final reward could never reach — and because it is distilled from the agent's own runs, it costs no separate reward model and no labelled rubric.
Put the supervision on a scale, holding the run fixed. Take a successful 12-step agent trajectory (illustrative). Plain outcome RL hands back 1 signal for the whole thing — one scalar to explain twelve decisions, so eleven of them get no targeted feedback at all. OPID instead emits an episode-level skill plus a step-level skill at each decision the critical-first router flags, so the same trajectory now carries roughly a dozen graded touch-points instead of one (illustrative). The trajectory count never changed; what changed is the resolution of the signal riding on it — from one bit at the end to guidance threaded through the whole run.
| Aspect | Outcome-based RL | OPID (On-Policy Skill Distillation) |
|---|---|---|
| Where the signal lands | Final outcome only | Every flagged step + the outcome |
| Supervision source | The environment's end reward | The agent's own completed runs (self-distilled) |
| Skill granularity | None — one scalar | Episode-level + step-level (two) |
| Which guidance applies | n/a | Critical-first routing picks per step |
| Intermediate steps | Credit smeared across the run | Dense, per flagged step |
Goes deeper in: AI Agents → Planning & Reflection → The Single-Shot Failure
Related explainers
- LongTraceRL — Rubric reward — the other way to densify a sparse RL signal: an external rubric grades each hop, where OPID instead mines the signal from the agent's own runs.
- CacheRL — Cached rollouts for agent RL — a different lever on the same agentic-RL pipeline: CacheRL makes the rollouts cheaper to run, where OPID makes the learning signal on them denser.
- CoPD — Reinforcement Learning with Verifiable Rewards (RLVR) — the outcome-only reward baseline OPID layers dense guidance on top of.