RELEX — Rank-1 RLVR trajectory extrapolation
LLMRank-1 RLVR weight-trajectory extrapolation is the technique introduced in You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories. The paper's diagnostic finding — that RLVR weight updates concentrate energy along a single direction in parameter space — is what makes the shortcut work: if the path is straight, you only need to watch the beginning to predict the end.
The news. On May 20, 2026, Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Chengsong Huang, Jiaxin Huang, and Yu Meng posted You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories. The paper reports that RLVR fine-tuning trajectories are empirically near-rank-1, that fitting a line through a short observation window of checkpoints and projecting forward matches or exceeds full-RLVR quality with as few as 15% of the training steps, and that the same line can be extrapolated 10–20× beyond the observation window. Validated on Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base. Read the paper →
Picture a car cruising down a long, dead-straight highway on cruise control. After you've watched it pass the first hundred or so mile markers, you don't need a camera at every subsequent marker to know roughly where it'll be at mile 1000 — the path is a line and a line takes two points to define. The RELEX paper's empirical claim is that a model being RLVR-trained behaves the same way: from a starting checkpoint, the path the weights trace through their billions-of-dimensions parameter space stays essentially in one lane. Stack the consecutive weight deltas into a matrix and the singular value spectrum is dominated by one direction; the rest is small enough to ignore.
That observation is what lets RELEX skip the middle of training. The method reads in a short observation window of checkpoints — the paper's headline runs use about 15% of the total step count — fits a rank-1 linear regression to the weight deltas in that window, and then writes a synthetic checkpoint by walking that line out to the end of training. No additional rollouts are generated, no extra verifier checks are run; the only cost beyond the observation window is the linear-algebra fit itself, which is negligible next to RLVR's per-step rollout compute.
There's a built-in sanity check baked into the setup. If RLVR trajectories were not rank-1, the extrapolation would diverge from the true path quickly and the resulting checkpoint would fail evals. The paper instead reports the extrapolated checkpoints match or exceed full-RLVR quality on MATH-500, AIME-style benchmarks, and code reasoning suites for the three Qwen families it evaluated. The extrapolated answer being a working model is itself the strongest evidence that the rank-1 structure isn't an artifact of a single configuration.
Most of RLVR's cost lives in the per-step rollout loop: for every training step, the model has to produce a complete worked answer one token at a time, the verifier has to score it, and the policy gradient has to be computed. Skipping 85% of those steps is a ~6–7× reduction in rollout count for the headline 15% recipe (illustrative — wall-clock and dollar savings depend on cluster, scheduler, and verifier cost, which the paper does not break out). The paper also reports a more aggressive setting — 50 observed steps extrapolated out to step 1000 — which is described as roughly a 20× reduction, with the trade-off being progressively wider error bars on downstream eval.
How RELEX compares to other post-training recipes
| Recipe | Reward source | Rollouts run per training step | Reported cost vs full RLVR (illustrative; varies by setup) |
|---|---|---|---|
| RLHF | Learned reward model | Every step | Comparable order of magnitude |
| RLVR (baseline) | Tiny verifier (0/1) | Every step | 1× (the reference point) |
| CoPD (explainer) | Tiny verifier (per expert) | Every step, per expert | ~N× wider; trades extra rollouts for a stronger policy |
| RELEX | Tiny verifier | Only inside the 15% observation window | ~6–7× fewer rollouts at the 15% setting; ~20× at 50 steps |
The contrast is methodological: RLHF, RLVR, and CoPD all keep the rollout loop running every step; RELEX is the only row that skips the rollouts entirely past the observation window. That's the line the rank-1 finding lets the paper cross.
Goes deeper in: LLM Internals → Text Generation → One Token at a Time — every "step" of RLVR is one full rollout, which is itself a stream of token-by-token generations.
Related explainers
- CoPD paper — Reinforcement Learning with Verifiable Rewards (RLVR) — the verifier-graded reward signal RELEX shortcuts around
- CoPD paper — Co-evolving Policy Distillation between parallel experts — a different angle on cheaper RLVR-class post-training
- TIM paper — Training-Inference Mismatch in RL — what can go subtly wrong inside the RL training loop RELEX is approximating