What is Training-Inference Mismatch (TIM)?

TIM is the small per-token probability disagreement between the rollout pipeline (the inference path that samples behavior from the current policy) and the policy-update pipeline (the training path that recomputes the model's probabilities on those same logged tokens during the gradient step). At full precision on identical weights the two should be identical. In practice they differ because the two pipelines use different kernels (e.g. FlashAttention vs. eager), different batch shapes, and different mixed-precision paths — and the differences are biased, not random. The TIM paper (Zhong et al., arXiv 2605.14220) introduces a controlled diagnostic called VeXact that isolates this drift from every other RL instability source and shows that TIM alone, with reward noise, optimizer drift, and distribution shift all suppressed, is enough to collapse a training run.

Why does numerical drift matter to PPO?

PPO uses the per-token importance ratio ρ = p_new / p_old to correct for the gap between the policy that generated a rollout and the policy being updated. When the rollout and update share the same weights, the standard assumption is that ρ sits at about 1.0 and the correction is a no-op. TIM violates that assumption: the rollout's logged probabilities and the update's recomputed probabilities differ by small but biased amounts, so ρ drifts away from 1.0 in a directional way. Even with PPO's clipping at the standard [0.8, 1.25] range, a 0.5% per-token bias accumulates over thousands of tokens and many trajectories into a gradient direction that doesn't match the actual reward slope, and the run can collapse without any of the usual culprits showing up in the eval suite.

How is TIM different from reward noise or optimizer drift?

Reward noise comes from a stochastic or mis-specified reward signal — the policy chases the wrong target. Optimizer drift comes from Adam moments accumulating stale state. Distribution shift comes from the policy moving faster than the rollout buffer can refresh. TIM is none of those: it's a systems-level perturbation that exists even when the reward is fixed and deterministic, the optimizer is reset every step, and the data distribution is held identical between runs. The VeXact diagnostic specifically suppresses the first three so the only remaining moving part is TIM, then shows the run can still collapse — which makes TIM an independent fourth cause that prior stability analyses miss because they implicitly assume rollout and update produce identical probabilities on identical weights.

TIM paper — Training-Inference Mismatch in RL

LLM

learnaivisually.com/ai-explained/tim-training-inference-mismatch

TL;DR

What is it: The TIM paper introduces a controlled diagnostic — VeXact — that isolates one suspected source of RL training instability: Training-Inference Mismatch, the small numerical disagreement between probabilities the rollout pipeline produced and probabilities the policy-update pipeline recomputes from the same logged tokens.
Why it’s needed: Most RL post-training pipelines treat the rollout/update probability gap as benign noise — something to log but not to fix. The paper argues this assumption is wrong: drift alone, with the diagnostic's main confounders suppressed, can warp the optimization landscape enough to collapse a run.
vs previous: Previous stability work blamed reward noise, optimizer instability, or distribution shift when an RL run diverged. TIM is an additional cause that those analyses can miss because rollout and update agree at full precision; they only diverge through kernel choices, batch sizes, and mixed-precision paths.

Jargon

TIM: Training-Inference Mismatch. The disagreement between the per-token probabilities produced during a rollout (the inference pipeline used to sample new behavior) and the per-token probabilities recomputed during a policy-gradient update on those same tokens. At full precision the two should be identical because the weights are identical; in practice they differ.
VeXact: The paper's controlled diagnostic. It isolates TIM by zeroing out every other suspected source of RL instability — reward noise, optimizer state drift, data-distribution shift — so any remaining training collapse can only be explained by the rollout/update probability gap.
Rollout: One pass of the inference pipeline that samples behavior from the current policy: feed a prompt, sample tokens one at a time, record each token together with the probability the model assigned to it. RL uses rollouts as the training data for the next update.
Policy update: The training step that takes a logged rollout, recomputes the model's probabilities on those exact tokens via a forward pass, computes a loss against the rollout's reward, and backpropagates. The recomputed probabilities are the policy's "new" estimate; the rollout's logged probabilities are its "old" estimate.
PPO importance ratio: The factor ρ = p_new / p_old used by Proximal Policy Optimization to correct for the gap between the policy that generated a rollout and the policy being updated. When rollout and update should agree exactly (same weights, same tokens), ρ should sit at 1.0. Drift away from 1.0 changes the effective objective. See LLM Internals → Text Generation → Sampling for the per-token probability surface this ratio is built on.
Numerical pipeline: The exact sequence of kernels, precisions, batch shapes, and reduction orders that turn weights and inputs into output logits. Two pipelines with the same weights can still produce slightly different logits if they use different attention implementations (e.g. FlashAttention vs. eager), different batch sizes (changing reduction order), or different mixed-precision paths.
Optimization landscape: The geometry of the loss as a function of the model's parameters. PPO climbs this landscape using rollouts as samples; if TIM warps the effective objective those samples report, the optimizer follows the wrong slope.

The news. On May 14, 2026, Zhong et al. posted arXiv 2605.14220, introducing VeXact — a controlled diagnostic for one suspected source of LLM RL instability. VeXact suppresses the main confounders the diagnostic targets (off-policy drift and common stabilization mechanisms), so the per-token probability disagreement between the rollout pipeline and the policy-update pipeline can be isolated. The paper shows this Training-Inference Mismatch — same weights, same logged tokens — can warp the effective optimization problem enough to collapse a run, and argues TIM should be treated as a primary stability factor, not a benign systems quirk.

Picture two kitchen scales, both calibrated to the same spec. Scale A sits on the prep counter; the baker uses it to weigh today's flour for today's loaf. Scale B sits in the recipe office; the recipe-keeper re-weighs the same bag of flour an hour later, logging it into tomorrow's recipe sheet. Both scales should read the same — same flour, same nominal calibration. In practice scale A reads 100.0 g and scale B reads 100.2 g. The 0.2 g gap looks like noise.

It would be noise, if the recipe-keeper used scale A's reading. But the recipe-keeper trusts scale B's log, then applies a "correction" the next day based on the difference between the log and the new measurement. After a week of feeding back the gap, the recipe has drifted and the bread keeps getting worse. That feedback loop — small disagreement gets multiplied back into the system — is the TIM story, transposed into RL post-training.

In a real RL loop the two scales are two numerical pipelines that share the same model weights but compute through different code paths. The rollout pipeline runs inference: it generates tokens one at a time, often using FlashAttention or a similar fused kernel, often in BF16 or FP8, and records each sampled token together with the probability the sampler assigned to it. The policy-update pipeline then takes those logged tokens, recomputes the model's probabilities on them via a fresh forward pass — typically in a different batch shape, sometimes with a different attention kernel — and feeds the gap between old and new probabilities into the PPO importance ratio ρ = p_new / p_old. PPO's whole correctness story assumes that when rollout and update share the same weights, ρ ≈ 1.0. TIM says they don't, and the gap is structural — kernel choice, batch size, reduction order — not random.

The paper's contribution is the VeXact diagnostic itself. Previous analyses of RL instability typically blamed one of three culprits: reward noise, optimizer-state drift, or shifts in the data distribution. VeXact builds a controlled environment that suppresses those confounders so that any remaining collapse can be attributed to TIM rather than to the usual suspects. That's the "aha" move: the prior list of culprits was incomplete, and TIM should be on it.

Where this earns its keep is a worked example with named numbers (illustrative; the source describes the mechanism in qualitative terms and does not publish a per-token TIM-magnitude → loss-spike table). Suppose the rollout pipeline samples a token with probability p_rollout = 0.7234. The policy-update pipeline recomputes that same token's probability and gets p_policy = 0.7268 — a relative drift of about 4.7 × 10⁻³. PPO multiplies the per-token advantage by ρ = 0.7268 / 0.7234 ≈ 1.0047. A 0.47% perturbation looks harmless per token. Over a 4,096-token sequence with the drift biased in one direction, the cumulative log ratio climbs to roughly 4096 × log(1.0047) ≈ 19, which translates into an importance weight of e¹⁹ ≈ 1.8 × 10⁸ for that single trajectory before any clipping. Even with PPO's standard clipping at [0.8, 1.25], that biased per-token push reshapes the gradient direction over many trajectories, and the optimizer follows a slope that isn't the actual reward slope.

Where TIM sits next to other RL post-training instability factors

Instability factor	What goes wrong	Standard mitigation	Detected by VeXact?
Training-Inference Mismatch (this paper)	Rollout pipeline and update pipeline compute slightly different probabilities on the same logged tokens — biased, not random	Open question; paper proposes "a set of remedies" without prescribing one default	Yes — VeXact's whole point is to isolate it
Reward noise	Reward signal is stochastic or mis-specified, so the policy chases the wrong target	Reward model averaging, hold-out evals, reward shaping	Suppressed by VeXact (fixed deterministic reward)
Optimizer state drift	Adam moments accumulate stale state across many small updates	Periodic optimizer reset, KL constraint to a frozen reference	Suppressed by VeXact (frozen optimizer reference)
Distribution shift	Rollouts at step N look statistically different from rollouts at step N+k, so the policy is updated on stale data	Smaller batch sizes, on-policy refresh, prioritized sampling	Suppressed by VeXact (identical data distribution between runs)

The interesting compositional point: TIM is invisible to the eval-and-iterate loop most teams already run. Validation loss on a held-out set won't surface it — the same drift that biases the PPO gradient produces a forward pass on validation that "looks fine" because the metric isn't the importance ratio. The signal lives inside the training pipeline itself: distributions of log(p_rollout / p_policy) per token, the realized ρ distribution before and after clipping, and the variance of those ratios across micro-batches. A team that wants to detect TIM in their own stack would log those distributions as a training-side metric rather than rely on downstream evals.

The paper does not prescribe a single fix. The Key claims section frames TIM as a primary stability factor for LLM RL — a thing to treat with the same seriousness as the other three causes in the table above — and notes the authors recommend a set of remedies, leaving the engineering trade-offs to follow-up work. Plausible directions named in the literature TIM cites include forcing rollout and update to share the exact same kernel and batch shape (often expensive, since rollout wants throughput and update wants gradient-friendly shapes), running the policy update at higher precision than the rollout (memory cost), or adding a per-token TIM bound that downweights or skips updates on tokens where the gap exceeds a threshold (loses data). The right pick is workload-dependent and the paper deliberately stops short of declaring a winner.

Goes deeper in: Agent Engineering → Production Evals & Shadow Mode → Drift Detection

Related explainers

CoPD — Co-evolving Policy Distillation between parallel experts — a different angle on RL post-training, where the source of instability is the teacher–student gap rather than the rollout–update gap
CoPD — Reinforcement Learning with Verifiable Rewards (RLVR) — the RL-post-training family TIM diagnoses inside of
PPOW — window-level RL for speculative drafters — another paper that re-examines what an RL "step" is when the inference pipeline doesn't match the training pipeline

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based