LongTraceRL — Rubric process reward
LLMThe news. On May 29, 2026, researchers from Tsinghua's THU-KEG group posted LongTraceRL, a recipe for generating long-context reasoning training data from search-agent trajectories. It builds multi-hop questions by random-walking a knowledge graph, mines graded distractors from the agent's own runs, and supervises training with an entity-level rubric reward — evaluated on three reasoning models from 4B to 30B parameters across five long-context benchmarks. Read the paper →
Picture the examiner again. Under the old rule, you drive a long route and get one stamp at the destination: passed or failed. If you fail, the examiner cannot tell you whether you blew the third turn or the seventh — the verdict is smeared across the whole drive. That single end-of-route stamp is exactly what plain outcome reward gives a multi-hop reasoning model: a correct four-hop answer earns one scalar, and the gradient cannot separate the hop that did the real work from the hops that coasted. On a long chain, errors compound turn by turn, and a single bit of feedback is too coarse to fix them.
LongTraceRL's rubric reward puts a grader at every turn. Because each question is grown from a knowledge-graph random walk, the paper knows which entity each hop is supposed to surface. The reward then checks, step by step, whether that entity actually shows up where it should — turning one end-of-chain signal into a dense, per-hop signal. This is the difference between process supervision and outcome supervision: reward the reasoning, not just the string at the end.
The clever part is the catch. If you reward "the right entity appeared at this step," a lazy model could just name-drop all the entities and never solve anything — classic reward hacking. LongTraceRL closes that door by gating the rubric to correct rollouts: the per-hop rewards only count when the final answer is right. Drive recklessly but happen to clip every checkpoint, and you still fail — the turns only score if you actually arrive. So the rubric densifies the signal for genuine solutions while the pass/fail check on the answer stays the hard gate that keeps the model honest.
Here is the bookkeeping, holding the rollout fixed. Take 1,000 training rollouts, each a correct four-hop answer. Under outcome reward you collect 1,000 reward signals — one scalar per trajectory — and none of them localize credit to a hop. Rubric reward scores four per-hop checks plus the answer = 5 signals per rollout, so the same 1,000 rollouts now carry roughly 5,000 graded signals (illustrative) — five times the resolution at the same trajectory count. And every signal from a rollout whose final answer is wrong is discarded, so a trajectory that name-drops all four entities but answers incorrectly contributes 0. Denser where it helps, zero where it would cheat.
| Aspect | Outcome reward (RLVR) | Rubric reward (LongTraceRL) |
|---|---|---|
| Where the reward lands | Final answer only | Every reasoning hop + the answer |
| Signal per 4-hop rollout | ~1 scalar (illustrative) | ~5 checks (illustrative) |
| Credit assignment | Smeared across the whole chain | Localized to each hop |
| Reward-hacking guard | Inherent (only correct answers score) | Gated to correct rollouts |
Goes deeper in: AI Agents → Evals & Diagnostics → Compounding errors
Related explainers
- CoPD paper — Reinforcement Learning with Verifiable Rewards (RLVR) — the outcome-reward baseline this rubric reward extends.
- VPO paper — Vector-reward advantage vs GRPO scalar collapse — another angle on making the reward signal richer than one scalar.
- Agent-harness scaling law — Effective Feedback Compute (EFC) — why the quality of feedback, not its raw amount, drives learning.