Why does OPID matter for agentic RL?

In outcome-based agentic RL, long-horizon agents are trained on a single success-or-failure reward at the end of a run, so credit for the right intermediate decisions smears across the whole trajectory. OPID gives dense, per-step guidance mined from the agent's own runs, which targets exactly the credit-assignment gap that makes long-horizon agents brittle.

How is OPID different from a reward model or a rubric?

A learned reward model or a hand-written rubric supplies process feedback from an outside source. OPID's supervision is self-distilled from the agent's own on-policy runs, so it needs no separate grader, stays on the policy's current distribution, and a critical-first router decides which skill guides each step.

OPID extracts step- and episode-level skills to guide agentic RL training — On-Policy Skill Distillation

Q: What is On-Policy Skill Distillation (OPID)?

OPID is a training recipe for agentic reinforcement learning that extracts reusable skills from an agent's own completed trajectories — an episode-level strategy (including failure-avoidance rules) and step-level decisions — and distills them back into the model as a dense, token-level signal. It adds intermediate guidance on top of the usual sparse outcome reward, with no external teacher or rubric.

Jargon

Agentic RL: Reinforcement learning where the thing being trained is a tool-using agent running a long think → act → observe loop, not a single-shot answer. The reward usually lands only when the whole task finishes.
Sparse / outcome reward: A reward delivered once, at the very end of a run — success or failure — with nothing in between. It is stable and hard to game, but it is a single bit of feedback for an entire trajectory.
Credit assignment: The problem of working out which earlier decision earned a later reward. With only an end-of-run signal, credit smears across every step, so the agent cannot tell the decisive moves from the lucky ones.
On-policy: Training on trajectories the current policy itself just generated, rather than logged or third-party data. OPID's skills are mined from the agent's own fresh runs, so the supervision stays on-distribution.
Episode-level vs step-level skill: OPID's two granularities of extracted skill: the episode-level skill is the high-level strategy — the global workflow that worked and failure-avoidance rules for what to skip; step-level skills are the specific intermediate actions that mattered.
Critical-first routing: OPID's mechanism for deciding which skill applies at each decision point — prioritising the pivotal steps first — so the episode strategy or a step-level move is selected where it helps most.
Token-level self-distillation: Turning the chosen skill into a per-token training target the model learns from its own outputs, blended with the usual RL outcome advantages. No separate teacher network is involved.

The news. On June 25, 2026, researchers posted OPID (On-Policy Skill Distillation), a recipe that adds dense guidance to sparse-reward agentic RL. Rather than an outside teacher or a hand-written rubric, OPID mines skills directly from the agent's own completed on-policy trajectories at two granularities — an episode-level strategy for the whole task (the global workflows that worked, plus failure-avoidance rules) and step-level decisions for the moments that mattered — then uses a critical-first routing mechanism to choose which skill applies at each step and blends it into training as a token-level self-distillation signal. It is evaluated on ALFWorld, WebShop, and search-based QA. Read the paper →

Picture a team that just won a hard match. Under the usual rule, all they get back is the final scoreboard: a win. Train an agent with outcome-based reinforcement learning and the feedback is just as thin — after a long run of tool calls and decisions, it learns one thing: the task succeeded, or it didn't. That single end-of-run signal is a sparse reward, and on a long task it cannot say which of the dozen intermediate decisions actually earned the win — the credit smears across the whole trajectory, exactly the compounding-error problem that makes long-horizon agents brittle.

OPID's move is to roll the game film. The team's own match film already contains the answer: the whole-game plan that worked, the few clutch moves that swung it, and the mistakes worth never repeating. OPID extracts these from the agent's own completed trajectories — an episode-level skill (the high-level strategy, including the failure-avoidance rules for what to skip) and step-level skills (which intermediate actions mattered). Crucially, the film is the team's own: there is no star team's playbook to copy and no human-written rubric to score against. The supervision comes from the agent's own runs, distilled back into itself.

Two skill levels raise a question — at any given decision, which one should guide the agent? OPID answers with a critical-first routing mechanism, like a coach who reviews the pivotal moments first, that selects whether the episode-level strategy or a step-level move applies at each step. That chosen skill becomes a token-level self-distillation target, blended with the usual end-of-run reward signal from RL. The result is dense guidance on the intermediate steps the sparse final reward could never reach — and because it is distilled from the agent's own runs, it costs no separate reward model and no labelled rubric.

Put the supervision on a scale, holding the run fixed. Take a successful 12-step agent trajectory (illustrative). Plain outcome RL hands back 1 signal for the whole thing — one scalar to explain twelve decisions, so eleven of them get no targeted feedback at all. OPID instead emits an episode-level skill plus a step-level skill at each decision the critical-first router flags, so the same trajectory now carries roughly a dozen graded touch-points instead of one (illustrative). The trajectory count never changed; what changed is the resolution of the signal riding on it — from one bit at the end to guidance threaded through the whole run.

Aspect	Outcome-based RL	OPID (On-Policy Skill Distillation)
Where the signal lands	Final outcome only	Every flagged step + the outcome
Supervision source	The environment's end reward	The agent's own completed runs (self-distilled)
Skill granularity	None — one scalar	Episode-level + step-level (two)
Which guidance applies	n/a	Critical-first routing picks per step
Intermediate steps	Credit smeared across the run	Dense, per flagged step

Goes deeper in: AI Agents → Planning & Reflection → The Single-Shot Failure

Related explainers

LongTraceRL — Rubric reward — the other way to densify a sparse RL signal: an external rubric grades each hop, where OPID instead mines the signal from the agent's own runs.
CacheRL — Cached rollouts for agent RL — a different lever on the same agentic-RL pipeline: CacheRL makes the rollouts cheaper to run, where OPID makes the learning signal on them denser.
CoPD — Reinforcement Learning with Verifiable Rewards (RLVR) — the outcome-only reward baseline OPID layers dense guidance on top of.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based