TIM paper — Training-Inference Mismatch in RL
LLMThe news. On May 14, 2026, Zhong et al. posted arXiv 2605.14220, introducing VeXact — a controlled diagnostic for one suspected source of LLM RL instability. VeXact suppresses the main confounders the diagnostic targets (off-policy drift and common stabilization mechanisms), so the per-token probability disagreement between the rollout pipeline and the policy-update pipeline can be isolated. The paper shows this Training-Inference Mismatch — same weights, same logged tokens — can warp the effective optimization problem enough to collapse a run, and argues TIM should be treated as a primary stability factor, not a benign systems quirk.
Picture two kitchen scales, both calibrated to the same spec. Scale A sits on the prep counter; the baker uses it to weigh today's flour for today's loaf. Scale B sits in the recipe office; the recipe-keeper re-weighs the same bag of flour an hour later, logging it into tomorrow's recipe sheet. Both scales should read the same — same flour, same nominal calibration. In practice scale A reads 100.0 g and scale B reads 100.2 g. The 0.2 g gap looks like noise.
It would be noise, if the recipe-keeper used scale A's reading. But the recipe-keeper trusts scale B's log, then applies a "correction" the next day based on the difference between the log and the new measurement. After a week of feeding back the gap, the recipe has drifted and the bread keeps getting worse. That feedback loop — small disagreement gets multiplied back into the system — is the TIM story, transposed into RL post-training.
In a real RL loop the two scales are two numerical pipelines that share the same model weights but compute through different code paths. The rollout pipeline runs inference: it generates tokens one at a time, often using FlashAttention or a similar fused kernel, often in BF16 or FP8, and records each sampled token together with the probability the sampler assigned to it. The policy-update pipeline then takes those logged tokens, recomputes the model's probabilities on them via a fresh forward pass — typically in a different batch shape, sometimes with a different attention kernel — and feeds the gap between old and new probabilities into the PPO importance ratio ρ = p_new / p_old. PPO's whole correctness story assumes that when rollout and update share the same weights, ρ ≈ 1.0. TIM says they don't, and the gap is structural — kernel choice, batch size, reduction order — not random.
The paper's contribution is the VeXact diagnostic itself. Previous analyses of RL instability typically blamed one of three culprits: reward noise, optimizer-state drift, or shifts in the data distribution. VeXact builds a controlled environment that suppresses those confounders so that any remaining collapse can be attributed to TIM rather than to the usual suspects. That's the "aha" move: the prior list of culprits was incomplete, and TIM should be on it.
Where this earns its keep is a worked example with named numbers (illustrative; the source describes the mechanism in qualitative terms and does not publish a per-token TIM-magnitude → loss-spike table). Suppose the rollout pipeline samples a token with probability p_rollout = 0.7234. The policy-update pipeline recomputes that same token's probability and gets p_policy = 0.7268 — a relative drift of about 4.7 × 10⁻³. PPO multiplies the per-token advantage by ρ = 0.7268 / 0.7234 ≈ 1.0047. A 0.47% perturbation looks harmless per token. Over a 4,096-token sequence with the drift biased in one direction, the cumulative log ratio climbs to roughly 4096 × log(1.0047) ≈ 19, which translates into an importance weight of e¹⁹ ≈ 1.8 × 10⁸ for that single trajectory before any clipping. Even with PPO's standard clipping at [0.8, 1.25], that biased per-token push reshapes the gradient direction over many trajectories, and the optimizer follows a slope that isn't the actual reward slope.
Where TIM sits next to other RL post-training instability factors
| Instability factor | What goes wrong | Standard mitigation | Detected by VeXact? |
|---|---|---|---|
| Training-Inference Mismatch (this paper) | Rollout pipeline and update pipeline compute slightly different probabilities on the same logged tokens — biased, not random | Open question; paper proposes "a set of remedies" without prescribing one default | Yes — VeXact's whole point is to isolate it |
| Reward noise | Reward signal is stochastic or mis-specified, so the policy chases the wrong target | Reward model averaging, hold-out evals, reward shaping | Suppressed by VeXact (fixed deterministic reward) |
| Optimizer state drift | Adam moments accumulate stale state across many small updates | Periodic optimizer reset, KL constraint to a frozen reference | Suppressed by VeXact (frozen optimizer reference) |
| Distribution shift | Rollouts at step N look statistically different from rollouts at step N+k, so the policy is updated on stale data | Smaller batch sizes, on-policy refresh, prioritized sampling | Suppressed by VeXact (identical data distribution between runs) |
The interesting compositional point: TIM is invisible to the eval-and-iterate loop most teams already run. Validation loss on a held-out set won't surface it — the same drift that biases the PPO gradient produces a forward pass on validation that "looks fine" because the metric isn't the importance ratio. The signal lives inside the training pipeline itself: distributions of log(p_rollout / p_policy) per token, the realized ρ distribution before and after clipping, and the variance of those ratios across micro-batches. A team that wants to detect TIM in their own stack would log those distributions as a training-side metric rather than rely on downstream evals.
The paper does not prescribe a single fix. The Key claims section frames TIM as a primary stability factor for LLM RL — a thing to treat with the same seriousness as the other three causes in the table above — and notes the authors recommend a set of remedies, leaving the engineering trade-offs to follow-up work. Plausible directions named in the literature TIM cites include forcing rollout and update to share the exact same kernel and batch shape (often expensive, since rollout wants throughput and update wants gradient-friendly shapes), running the policy update at higher precision than the rollout (memory cost), or adding a per-token TIM bound that downweights or skips updates on tokens where the gap exceeds a threshold (loses data). The right pick is workload-dependent and the paper deliberately stops short of declaring a winner.
Goes deeper in: Agent Engineering → Production Evals & Shadow Mode → Drift Detection
Related explainers
- CoPD — Co-evolving Policy Distillation between parallel experts — a different angle on RL post-training, where the source of instability is the teacher–student gap rather than the rollout–update gap
- CoPD — Reinforcement Learning with Verifiable Rewards (RLVR) — the RL-post-training family TIM diagnoses inside of
- PPOW — window-level RL for speculative drafters — another paper that re-examines what an RL "step" is when the inference pipeline doesn't match the training pipeline