RELEX — Rank-1 RLVR trajectory extrapolation

LLM
L
RLVR weight trajectories sit on a near-1D line — fit 15% and extrapolate the restWeight-space view · checkpoint trajectorystep 0000 / 1000PCA direction 1 · ≈ training step 0 → 1000other PCA dirs · ≈ 0Cost · Quality readoutTraining steps · of 1,0001,000Rollout cost (relative)1.00×Eval (MATH-500) · same either way73.2
learnaivisually.com/ai-explained/relex-rank-1-extrapolation

Rank-1 RLVR weight-trajectory extrapolation is the technique introduced in You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories. The paper's diagnostic finding — that RLVR weight updates concentrate energy along a single direction in parameter space — is what makes the shortcut work: if the path is straight, you only need to watch the beginning to predict the end.

The news. On May 20, 2026, Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Chengsong Huang, Jiaxin Huang, and Yu Meng posted You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories. The paper reports that RLVR fine-tuning trajectories are empirically near-rank-1, that fitting a line through a short observation window of checkpoints and projecting forward matches or exceeds full-RLVR quality with as few as 15% of the training steps, and that the same line can be extrapolated 10–20× beyond the observation window. Validated on Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base. Read the paper →

Picture a car cruising down a long, dead-straight highway on cruise control. After you've watched it pass the first hundred or so mile markers, you don't need a camera at every subsequent marker to know roughly where it'll be at mile 1000 — the path is a line and a line takes two points to define. The RELEX paper's empirical claim is that a model being RLVR-trained behaves the same way: from a starting checkpoint, the path the weights trace through their billions-of-dimensions parameter space stays essentially in one lane. Stack the consecutive weight deltas into a matrix and the singular value spectrum is dominated by one direction; the rest is small enough to ignore.

That observation is what lets RELEX skip the middle of training. The method reads in a short observation window of checkpoints — the paper's headline runs use about 15% of the total step count — fits a rank-1 linear regression to the weight deltas in that window, and then writes a synthetic checkpoint by walking that line out to the end of training. No additional rollouts are generated, no extra verifier checks are run; the only cost beyond the observation window is the linear-algebra fit itself, which is negligible next to RLVR's per-step rollout compute.

There's a built-in sanity check baked into the setup. If RLVR trajectories were not rank-1, the extrapolation would diverge from the true path quickly and the resulting checkpoint would fail evals. The paper instead reports the extrapolated checkpoints match or exceed full-RLVR quality on MATH-500, AIME-style benchmarks, and code reasoning suites for the three Qwen families it evaluated. The extrapolated answer being a working model is itself the strongest evidence that the rank-1 structure isn't an artifact of a single configuration.

Most of RLVR's cost lives in the per-step rollout loop: for every training step, the model has to produce a complete worked answer one token at a time, the verifier has to score it, and the policy gradient has to be computed. Skipping 85% of those steps is a ~6–7× reduction in rollout count for the headline 15% recipe (illustrative — wall-clock and dollar savings depend on cluster, scheduler, and verifier cost, which the paper does not break out). The paper also reports a more aggressive setting — 50 observed steps extrapolated out to step 1000 — which is described as roughly a 20× reduction, with the trade-off being progressively wider error bars on downstream eval.

How RELEX compares to other post-training recipes

RecipeReward sourceRollouts run per training stepReported cost vs full RLVR (illustrative; varies by setup)
RLHFLearned reward modelEvery stepComparable order of magnitude
RLVR (baseline)Tiny verifier (0/1)Every step1× (the reference point)
CoPD (explainer)Tiny verifier (per expert)Every step, per expert~N× wider; trades extra rollouts for a stronger policy
RELEXTiny verifierOnly inside the 15% observation window~6–7× fewer rollouts at the 15% setting; ~20× at 50 steps

The contrast is methodological: RLHF, RLVR, and CoPD all keep the rollout loop running every step; RELEX is the only row that skips the rollouts entirely past the observation window. That's the line the rank-1 finding lets the paper cross.

Goes deeper in: LLM Internals → Text Generation → One Token at a Time — every "step" of RLVR is one full rollout, which is itself a stream of token-by-token generations.

Related explainers

Frequently Asked Questions