The news. On June 25, 2026, researchers posted OPID (On-Policy Skill Distillation), a recipe that adds dense guidance to sparse-reward agentic RL. Rather than an outside teacher or a hand-written rubric, OPID mines skills directly from the agent's own completed on-policy trajectories at two granularities — an episode-level strategy for the whole task (the global workflows that worked, plus failure-avoidance rules) and step-level decisions for the moments that mattered — then uses a critical-first routing mechanism to choose which skill applies at each step and blends it into training as a token-level self-distillation signal. It is evaluated on ALFWorld, WebShop, and search-based QA. Read the paper →

Picture a team that just won a hard match. Under the usual rule, all they get back is the final scoreboard: a win. Train an agent with outcome-based reinforcement learning and the feedback is just as thin — after a long run of tool calls and decisions, it learns one thing: the task succeeded, or it didn't. That single end-of-run signal is a sparse reward, and on a long task it cannot say which of the dozen intermediate decisions actually earned the win — the credit smears across the whole trajectory, exactly the compounding-error problem that makes long-horizon agents brittle.

OPID's move is to roll the game film. The team's own match film already contains the answer: the whole-game plan that worked, the few clutch moves that swung it, and the mistakes worth never repeating. OPID extracts these from the agent's own completed trajectories — an episode-level skill (the high-level strategy, including the failure-avoidance rules for what to skip) and step-level skills (which intermediate actions mattered). Crucially, the film is the team's own: there is no star team's playbook to copy and no human-written rubric to score against. The supervision comes from the agent's own runs, distilled back into itself.

Two skill levels raise a question — at any given decision, which one should guide the agent? OPID answers with a critical-first routing mechanism, like a coach who reviews the pivotal moments first, that selects whether the episode-level strategy or a step-level move applies at each step. That chosen skill becomes a token-level self-distillation target, blended with the usual end-of-run reward signal from RL. The result is dense guidance on the intermediate steps the sparse final reward could never reach — and because it is distilled from the agent's own runs, it costs no separate reward model and no labelled rubric.

Put the supervision on a scale, holding the run fixed. Take a successful 12-step agent trajectory (illustrative). Plain outcome RL hands back 1 signal for the whole thing — one scalar to explain twelve decisions, so eleven of them get no targeted feedback at all. OPID instead emits an episode-level skill plus a step-level skill at each decision the critical-first router flags, so the same trajectory now carries roughly a dozen graded touch-points instead of one (illustrative). The trajectory count never changed; what changed is the resolution of the signal riding on it — from one bit at the end to guidance threaded through the whole run.

AspectOutcome-based RLOPID (On-Policy Skill Distillation)
Where the signal landsFinal outcome onlyEvery flagged step + the outcome
Supervision sourceThe environment's end rewardThe agent's own completed runs (self-distilled)
Skill granularityNone — one scalarEpisode-level + step-level (two)
Which guidance appliesn/aCritical-first routing picks per step
Intermediate stepsCredit smeared across the runDense, per flagged step

Goes deeper in: AI Agents → Planning & Reflection → The Single-Shot Failure

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based