LongTraceRL — Rubric process reward

LLM
L
Reward every reasoning step, not just the final answer↓ rubric's required entity at each hopentity 1Hop 1?entity 2Hop 2?entity 3Hop 3?entity 4Hop 4?Answerthe only rewardreward gateanswer ✓ → grantanswer ✗ → 0training signal per rollout (reward checks)×1Same correct rollout throughout — only the reward signal changes.Same correct 4-hop rollout — re-scored hop by hopreward checks: 15gated to correct rollouts — no reward-hackingprocess supervision densifies the signal, but only when the final answer is right
learnaivisually.com/ai-explained/longtracerl-rubric-process-reward

The news. On May 29, 2026, researchers from Tsinghua's THU-KEG group posted LongTraceRL, a recipe for generating long-context reasoning training data from search-agent trajectories. It builds multi-hop questions by random-walking a knowledge graph, mines graded distractors from the agent's own runs, and supervises training with an entity-level rubric reward — evaluated on three reasoning models from 4B to 30B parameters across five long-context benchmarks. Read the paper →

Picture the examiner again. Under the old rule, you drive a long route and get one stamp at the destination: passed or failed. If you fail, the examiner cannot tell you whether you blew the third turn or the seventh — the verdict is smeared across the whole drive. That single end-of-route stamp is exactly what plain outcome reward gives a multi-hop reasoning model: a correct four-hop answer earns one scalar, and the gradient cannot separate the hop that did the real work from the hops that coasted. On a long chain, errors compound turn by turn, and a single bit of feedback is too coarse to fix them.

LongTraceRL's rubric reward puts a grader at every turn. Because each question is grown from a knowledge-graph random walk, the paper knows which entity each hop is supposed to surface. The reward then checks, step by step, whether that entity actually shows up where it should — turning one end-of-chain signal into a dense, per-hop signal. This is the difference between process supervision and outcome supervision: reward the reasoning, not just the string at the end.

The clever part is the catch. If you reward "the right entity appeared at this step," a lazy model could just name-drop all the entities and never solve anything — classic reward hacking. LongTraceRL closes that door by gating the rubric to correct rollouts: the per-hop rewards only count when the final answer is right. Drive recklessly but happen to clip every checkpoint, and you still fail — the turns only score if you actually arrive. So the rubric densifies the signal for genuine solutions while the pass/fail check on the answer stays the hard gate that keeps the model honest.

Here is the bookkeeping, holding the rollout fixed. Take 1,000 training rollouts, each a correct four-hop answer. Under outcome reward you collect 1,000 reward signals — one scalar per trajectory — and none of them localize credit to a hop. Rubric reward scores four per-hop checks plus the answer = 5 signals per rollout, so the same 1,000 rollouts now carry roughly 5,000 graded signals (illustrative) — five times the resolution at the same trajectory count. And every signal from a rollout whose final answer is wrong is discarded, so a trajectory that name-drops all four entities but answers incorrectly contributes 0. Denser where it helps, zero where it would cheat.

AspectOutcome reward (RLVR)Rubric reward (LongTraceRL)
Where the reward landsFinal answer onlyEvery reasoning hop + the answer
Signal per 4-hop rollout~1 scalar (illustrative)~5 checks (illustrative)
Credit assignmentSmeared across the whole chainLocalized to each hop
Reward-hacking guardInherent (only correct answers score)Gated to correct rollouts

Goes deeper in: AI Agents → Evals & Diagnostics → Compounding errors

Related explainers

Frequently Asked Questions