Why do outcome-only rewards mislead agent training?

A single end-of-run reward is what the paper calls 'structurally incomplete.' Because it grades every action by the final result, it punishes useful exploration that happened to sit inside a failed run and rewards redundant or regressive actions that happened to sit inside a successful one. Over a long trajectory the credit lands on the wrong moves, so the agent learns to keep its busywork and drop its good ideas.

How does TRIAGE relate to GRPO?

TRIAGE keeps GRPO's setup and only changes credit assignment. GRPO scores a group of rollouts against each other using the outcome reward; TRIAGE adds a role-typing judge that produces segment-level process rewards on top of that outcome signal. The paper reports higher success than GRPO on ALFWorld, Search-QA, and WebShop while cutting environment-facing turns 10.4% to 14.8% on completed rollouts — the same tasks solved in fewer, cleaner steps.

TRIAGE cuts agent turns up to 14.8% — Role-typed credit assignment

Q: What is role-typed credit assignment (TRIAGE)?

TRIAGE (arXiv 2606.32017) is a reinforcement-learning method for LLM agents that augments GRPO with a structured judge. Instead of handing the whole run one win-or-lose reward, the judge labels each action segment with one of four roles — decisive progress, useful exploration, no-progress infrastructure, or regression — and fixed rules turn each role into a per-segment process reward. The task's actual success still drives optimization; the roles decide which individual actions get the credit for it.

Jargon

GRPO (Group Relative Policy Optimization): A reinforcement-learning method that runs a batch of attempts, scores them against each other, and trains toward the better ones — using the final outcome as the reward. TRIAGE keeps GRPO and only changes how credit is spread across the actions inside each run. A different GRPO fix attacks the reward from another angle.
Credit assignment: The long-standing RL problem of deciding which of an agent’s many actions deserve the credit (or blame) for how the whole run turned out. Over a long trajectory this is genuinely hard.
Outcome reward (sparse reward): A single win-or-lose signal handed out only at the end of a run, with nothing in between. Simple, but blunt: it cannot tell a good action in a failed run from a bad one.
Process reward (dense supervision): A reward given per step or per segment during the run, not just at the end. TRIAGE manufactures these from each action’s role. Rubric-based process rewards are a related take.
Rollout (trajectory): One full attempt — the whole sequence of actions the agent takes from the start of a task to its finish. RL learns by comparing rollouts.
Action segment: A short chunk of the rollout — one action or a few — that TRIAGE’s judge labels with a single role (progress, exploration, no-progress, or regression) before assigning it credit.

The news. On July 1, 2026, the TRIAGE (Role-Typed Credit Assignment) paper (arXiv 2606.32017) proposed augmenting GRPO with a structured judge that sorts each of an agent’s action segments into one of four roles — decisive progress, useful exploration, no-progress infrastructure, or regression — and shapes a segment-level process reward from that role. Across ALFWorld, Search-QA, and WebShop it reports higher success than GRPO while cutting environment-facing turns 10.4% to 14.8% on completed rollouts. Read the paper →

Picture a chess coach reviewing a game the way annotators really do — a ! for a strong move, a !? for a bold probe that didn't quite work, an = for a move that just marked time, a ?? for a blunder. Now imagine grading every move by the scoreboard instead: you lost, so the coach frowns at all of them; you won, so the coach nods at all of them. That is roughly how most reinforcement learning trains an agent, and it is exactly the trap TRIAGE is built to escape.

An LLM agent solves a task as a sequence of actions — search, click, read, answer — and standard GRPO grades the whole run with a single number: did it succeed? The authors call that signal "structurally incomplete." A failed run can hold the right idea explored a few moves too late, and a successful run can drag in redundant clicks or a step that quietly undid progress. Hand the same win-or-lose reward to every action, and the agent is trained to repeat the wasted moves buried in its wins and to shy away from the smart probes buried in its losses — the credit lands on the wrong actions, and small errors compound into the wrong policy.

TRIAGE's move is to put the coach back in. A structured judge reads the rollout and grades each action, labeling every action segment with one of four roles: decisive progress, useful exploration, no-progress infrastructure, or regression. Fixed role rules then turn each label into a small per-segment reward — credit for real progress and for a genuine probe, nothing for filler, a penalty for a step that gave back ground — while the task's actual success stays the optimization signal on top. The final score still decides whether the run was good; the roles decide which actions get the credit for it.

Walk one failed run through it (illustrative). Say the agent took 10 action segments and lost. Outcome-only credit sends a negative signal to all 10 — including the 3 that were useful exploration and the 2 that made decisive progress before a late slip. TRIAGE relabels them: the 5 progress-and-probe segments earn positive credit, the 3 no-progress ones get zero, and only the 2 regressions are penalized. Same trajectory, opposite lesson — 5 of the 10 segments now push the policy the right way instead of all 10 pushing it away. On real tasks that re-crediting is what trims the fat: the paper reports 10.4% to 14.8% fewer environment-facing turns on completed rollouts across ALFWorld, Search-QA, and WebShop (arXiv 2606.32017).

Action segment	What it is	Outcome-only credit	TRIAGE role-typed credit
decisive progress	a move that clearly advances the task	credited only if the run won	credited on its own merit
useful exploration (!?)	a reasonable probe that didn't pan out	punished if the run lost	protected — the right idea still earns credit
no-progress infrastructure	setup or busywork that neither helps nor hurts	rewarded if the run won	zeroed out — no free credit for filler
regression (??)	a move that undoes earlier progress	rewarded if the run won	penalized even inside a win

Because the agent stops being rewarded for busywork and stops being punished for good ideas, it converges on shorter, cleaner solutions — the same success reached in fewer turns, which is the whole cost story for a production agent that pays per tool call and per token. The bigger point isn't the benchmark delta: it's that a single end-of-run reward can be too blunt to teach a long-horizon agent, and typing each action by its role is a cheap way to hand the policy the credit it was missing (arXiv 2606.32017).

Goes deeper in: AI Agents → Planning & Reflection → When to Spend More Tokens

Related explainers

LongTraceRL — rubric process reward — another way to hand an agent dense, per-step reward instead of one end-of-run score
DRPO — smooth trust-region penalty — a different fix to how GRPO shapes its gradient at the trust-region boundary
VPO — vector-reward vs GRPO — attacks GRPO's blunt scalar reward by making the advantage a vector
CacheRL — cached rollouts for agent RL — how agent RL trims the cost of collecting all those rollouts

Continue in trackAI Agents — Planning & Reflection: when to spend more tokens

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based