TIM paper — Training-Inference Mismatch in RL

LLM
L
Same weights, same tokens — rollout and policy compute slightly different probabilitiestraining lossRL update step →per-token drift |log(p_rollout / p_policy)|collapse threshold (illustrative)PPO importance ratio ρ = p_new / p_old1.00.71.3ρ = 1.000logged tokensper-token: p_rollout · p_policythemodelwritesonetokenatatimeandwelogit
learnaivisually.com/ai-explained/tim-training-inference-mismatch

The news. On May 14, 2026, Zhong et al. posted arXiv 2605.14220, introducing VeXact — a controlled diagnostic for one suspected source of LLM RL instability. VeXact suppresses the main confounders the diagnostic targets (off-policy drift and common stabilization mechanisms), so the per-token probability disagreement between the rollout pipeline and the policy-update pipeline can be isolated. The paper shows this Training-Inference Mismatch — same weights, same logged tokens — can warp the effective optimization problem enough to collapse a run, and argues TIM should be treated as a primary stability factor, not a benign systems quirk.

Picture two kitchen scales, both calibrated to the same spec. Scale A sits on the prep counter; the baker uses it to weigh today's flour for today's loaf. Scale B sits in the recipe office; the recipe-keeper re-weighs the same bag of flour an hour later, logging it into tomorrow's recipe sheet. Both scales should read the same — same flour, same nominal calibration. In practice scale A reads 100.0 g and scale B reads 100.2 g. The 0.2 g gap looks like noise.

It would be noise, if the recipe-keeper used scale A's reading. But the recipe-keeper trusts scale B's log, then applies a "correction" the next day based on the difference between the log and the new measurement. After a week of feeding back the gap, the recipe has drifted and the bread keeps getting worse. That feedback loop — small disagreement gets multiplied back into the system — is the TIM story, transposed into RL post-training.

In a real RL loop the two scales are two numerical pipelines that share the same model weights but compute through different code paths. The rollout pipeline runs inference: it generates tokens one at a time, often using FlashAttention or a similar fused kernel, often in BF16 or FP8, and records each sampled token together with the probability the sampler assigned to it. The policy-update pipeline then takes those logged tokens, recomputes the model's probabilities on them via a fresh forward pass — typically in a different batch shape, sometimes with a different attention kernel — and feeds the gap between old and new probabilities into the PPO importance ratio ρ = p_new / p_old. PPO's whole correctness story assumes that when rollout and update share the same weights, ρ ≈ 1.0. TIM says they don't, and the gap is structural — kernel choice, batch size, reduction order — not random.

The paper's contribution is the VeXact diagnostic itself. Previous analyses of RL instability typically blamed one of three culprits: reward noise, optimizer-state drift, or shifts in the data distribution. VeXact builds a controlled environment that suppresses those confounders so that any remaining collapse can be attributed to TIM rather than to the usual suspects. That's the "aha" move: the prior list of culprits was incomplete, and TIM should be on it.

Where this earns its keep is a worked example with named numbers (illustrative; the source describes the mechanism in qualitative terms and does not publish a per-token TIM-magnitude → loss-spike table). Suppose the rollout pipeline samples a token with probability p_rollout = 0.7234. The policy-update pipeline recomputes that same token's probability and gets p_policy = 0.7268 — a relative drift of about 4.7 × 10⁻³. PPO multiplies the per-token advantage by ρ = 0.7268 / 0.7234 ≈ 1.0047. A 0.47% perturbation looks harmless per token. Over a 4,096-token sequence with the drift biased in one direction, the cumulative log ratio climbs to roughly 4096 × log(1.0047) ≈ 19, which translates into an importance weight of e¹⁹ ≈ 1.8 × 10⁸ for that single trajectory before any clipping. Even with PPO's standard clipping at [0.8, 1.25], that biased per-token push reshapes the gradient direction over many trajectories, and the optimizer follows a slope that isn't the actual reward slope.

Where TIM sits next to other RL post-training instability factors

Instability factorWhat goes wrongStandard mitigationDetected by VeXact?
Training-Inference Mismatch (this paper)Rollout pipeline and update pipeline compute slightly different probabilities on the same logged tokens — biased, not randomOpen question; paper proposes "a set of remedies" without prescribing one defaultYes — VeXact's whole point is to isolate it
Reward noiseReward signal is stochastic or mis-specified, so the policy chases the wrong targetReward model averaging, hold-out evals, reward shapingSuppressed by VeXact (fixed deterministic reward)
Optimizer state driftAdam moments accumulate stale state across many small updatesPeriodic optimizer reset, KL constraint to a frozen referenceSuppressed by VeXact (frozen optimizer reference)
Distribution shiftRollouts at step N look statistically different from rollouts at step N+k, so the policy is updated on stale dataSmaller batch sizes, on-policy refresh, prioritized samplingSuppressed by VeXact (identical data distribution between runs)

The interesting compositional point: TIM is invisible to the eval-and-iterate loop most teams already run. Validation loss on a held-out set won't surface it — the same drift that biases the PPO gradient produces a forward pass on validation that "looks fine" because the metric isn't the importance ratio. The signal lives inside the training pipeline itself: distributions of log(p_rollout / p_policy) per token, the realized ρ distribution before and after clipping, and the variance of those ratios across micro-batches. A team that wants to detect TIM in their own stack would log those distributions as a training-side metric rather than rely on downstream evals.

The paper does not prescribe a single fix. The Key claims section frames TIM as a primary stability factor for LLM RL — a thing to treat with the same seriousness as the other three causes in the table above — and notes the authors recommend a set of remedies, leaving the engineering trade-offs to follow-up work. Plausible directions named in the literature TIM cites include forcing rollout and update to share the exact same kernel and batch shape (often expensive, since rollout wants throughput and update wants gradient-friendly shapes), running the policy update at higher precision than the rollout (memory cost), or adding a per-token TIM bound that downweights or skips updates on tokens where the gap exceeds a threshold (loses data). The right pick is workload-dependent and the paper deliberately stops short of declaring a winner.

Goes deeper in: Agent Engineering → Production Evals & Shadow Mode → Drift Detection

Related explainers

Frequently Asked Questions