What is a rubric reward?

A rubric reward is a process-supervision signal: for each question, a rubric lists the entity every reasoning hop should surface, and the reward checks whether that entity appears at the right step. Instead of one pass/fail at the end, the model gets a graded check at every hop.

Why does process supervision matter for long-context reasoning?

A long multi-hop chain earns only a single scalar under outcome reward, so credit is smeared across the whole trajectory and the model cannot tell which hops carried the work. Process supervision gives a denser, per-hop signal, which is far more informative as chains get longer.

How does rubric reward avoid reward-hacking?

LongTraceRL gates the rubric reward to correct rollouts: the per-hop checks only count when the final answer is right. A model that name-drops every required entity but answers incorrectly earns zero, so it cannot farm the rubric without actually solving the task.

LongTraceRL — Rubric reward (entity-level process supervision)

LongTraceRL — Rubric process reward

LLM

learnaivisually.com/ai-explained/longtracerl-rubric-process-reward

Jargon

RLVR: Reinforcement Learning with Verifiable Rewards — train a model by rewarding it only when a checker confirms the final answer is correct. Simple and hack-resistant, but the reward is a single bit at the end. See the RLVR explainer.
Process reward: A reward attached to intermediate steps of a solution, not just the final answer. The opposite is an outcome reward, which scores the endpoint alone. Process rewards give denser, higher-resolution feedback along a chain.
Rubric reward: LongTraceRL's process reward. A per-question rubric lists which entity each hop should surface; the reward checks whether the right entity appears at the right step of the chain.
Reward hacking: When a model maximizes the reward without doing the real task — here, name-dropping the rubric's entities without actually solving the question. LongTraceRL blocks this by gating the rubric reward to rollouts whose final answer is correct.
Multi-hop reasoning: A question whose answer needs several linked lookups — hop to fact A, then to fact B it implies, then to C. LongTraceRL builds these from random walks over a knowledge graph, so the supporting entity at each hop is known in advance.
Tiered distractors: Wrong-but-tempting passages mined from the agent's own runs: documents it read but did not cite (hard negatives) and results it never opened (easy negatives). Graded difficulty, for free.

The news. On May 29, 2026, researchers from Tsinghua's THU-KEG group posted LongTraceRL, a recipe for generating long-context reasoning training data from search-agent trajectories. It builds multi-hop questions by random-walking a knowledge graph, mines graded distractors from the agent's own runs, and supervises training with an entity-level rubric reward — evaluated on three reasoning models from 4B to 30B parameters across five long-context benchmarks. Read the paper →

Picture the examiner again. Under the old rule, you drive a long route and get one stamp at the destination: passed or failed. If you fail, the examiner cannot tell you whether you blew the third turn or the seventh — the verdict is smeared across the whole drive. That single end-of-route stamp is exactly what plain outcome reward gives a multi-hop reasoning model: a correct four-hop answer earns one scalar, and the gradient cannot separate the hop that did the real work from the hops that coasted. On a long chain, errors compound turn by turn, and a single bit of feedback is too coarse to fix them.

LongTraceRL's rubric reward puts a grader at every turn. Because each question is grown from a knowledge-graph random walk, the paper knows which entity each hop is supposed to surface. The reward then checks, step by step, whether that entity actually shows up where it should — turning one end-of-chain signal into a dense, per-hop signal. This is the difference between process supervision and outcome supervision: reward the reasoning, not just the string at the end.

The clever part is the catch. If you reward "the right entity appeared at this step," a lazy model could just name-drop all the entities and never solve anything — classic reward hacking. LongTraceRL closes that door by gating the rubric to correct rollouts: the per-hop rewards only count when the final answer is right. Drive recklessly but happen to clip every checkpoint, and you still fail — the turns only score if you actually arrive. So the rubric densifies the signal for genuine solutions while the pass/fail check on the answer stays the hard gate that keeps the model honest.

Here is the bookkeeping, holding the rollout fixed. Take 1,000 training rollouts, each a correct four-hop answer. Under outcome reward you collect 1,000 reward signals — one scalar per trajectory — and none of them localize credit to a hop. Rubric reward scores four per-hop checks plus the answer = 5 signals per rollout, so the same 1,000 rollouts now carry roughly 5,000 graded signals (illustrative) — five times the resolution at the same trajectory count. And every signal from a rollout whose final answer is wrong is discarded, so a trajectory that name-drops all four entities but answers incorrectly contributes 0. Denser where it helps, zero where it would cheat.

Aspect	Outcome reward (RLVR)	Rubric reward (LongTraceRL)
Where the reward lands	Final answer only	Every reasoning hop + the answer
Signal per 4-hop rollout	~1 scalar (illustrative)	~5 checks (illustrative)
Credit assignment	Smeared across the whole chain	Localized to each hop
Reward-hacking guard	Inherent (only correct answers score)	Gated to correct rollouts

Goes deeper in: AI Agents → Evals & Diagnostics → Compounding errors

Related explainers

CoPD paper — Reinforcement Learning with Verifiable Rewards (RLVR) — the outcome-reward baseline this rubric reward extends.
VPO paper — Vector-reward advantage vs GRPO scalar collapse — another angle on making the reward signal richer than one scalar.
Agent-harness scaling law — Effective Feedback Compute (EFC) — why the quality of feedback, not its raw amount, drives learning.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based