The news. On June 8, 2026, researchers from Tencent-Hunyuan posted "Rethinking the Divergence Regularization in LLM RL." It targets a small but load-bearing piece of how reasoning models are trained: the trust-region term that keeps each RL update from moving the policy too far. PPO and GRPO enforce it by clipping an importance ratio; DPPO improved on that with a divergence mask. The paper's contribution, DRPO, keeps DPPO's trust-region geometry but swaps the hard mask for a smooth, advantage-weighted penalty — so a token that strays past the boundary is pulled back instead of dropped. Read the paper →
Picture walking a dog on a leash. Where you stand is the old policy — the model as it was before this update. The dog is the new policy, and every training step lets it wander a little. The leash is the trust region: the rule that says don't let it stray too far. You need that rule because the dog is half-trained on yesterday's walk — RL post-training is off-policy, learning from slightly stale rollouts — so without a leash one excited lunge can drag the whole policy somewhere unstable. (The "policy" here is nothing exotic: it's just the model's next-token probability distribution, the thing every RL update reshapes.)
The leash everyone reached for first is ratio clipping (PPO, GRPO): measure how much more or less likely the new policy makes each token — the importance ratio — and clip it to a band. The trouble is that over a long-tailed vocabulary, that ratio is a bad ruler for how much the distribution actually moved. DPPO fixed the ruler: it measures a token's absolute probability shift and masks any token that crosses the line. But DPPO's mask is hard — once a token crosses the boundary in a harmful direction, you drop the leash: its gradient is discarded, and the diverging token gets zero signal to come back. It just stays wherever it drifted, which is exactly the kind of training-vs-inference gap the trust region was supposed to keep in check.
DRPO swaps the dropped leash for a bungee. Instead of a binary keep-or-discard mask, it adds a smooth, advantage-weighted quadratic penalty on policy shift: the farther a token strays past the boundary, the firmer — but always bounded — the pull back toward the old policy, scaled by how much that token's advantage says it matters. Same trust-region geometry as DPPO, but the cliff becomes a ramp: a diverging update is attenuated and corrected rather than thrown away. For the reasoning models that lean hardest on long RL runs, that continuous corrective signal is what keeps training from quietly going off the rails.
| Method | Trust-region ruler | Past the boundary | What the diverging token gets |
|---|---|---|---|
| PPO / GRPO | clip the importance ratio | ratio clamped to a band | gradient zeroed at the clip edge — and a poor proxy over a long-tailed vocabulary |
| DPPO | divergence mask on absolute probability shift | hard in/out mask | gradient discarded — no corrective signal |
| DRPO | smooth advantage-weighted quadratic penalty | bounded, continuous penalty | a corrective pull-back, firmer the farther it strays |
Walk a single token through one update (the gradient weights here are illustrative — the paper reports stability and efficiency, not these figures). Say the trust-region radius is δ, and this token's policy shift lands at 1.5 δ — half again past the boundary, in the harmful direction. Under DPPO's hard mask the corrective weight is a step function: inside δ it's 1, past δ it's 0 — so this token contributes a gradient weight of exactly 0. It's frozen wherever it drifted, and nothing pulls it back. Under DRPO the weight is a smooth, bounded curve of the same shift: at the boundary it's near 1, and at 1.5 δ it has attenuated to roughly 0.4 (illustrative) — smaller than a fully in-region token, but decisively non-zero. The diverging token the hard mask abandoned is exactly the one DRPO keeps a hand on. Multiply that across the handful of long-tail tokens that cross the boundary each step, and the difference is a training run that drifts versus one that self-corrects.
Goes deeper in: LLM Internals → Text Generation → Logits
Related explainers
- VPO — vector-reward advantage vs GRPO — a different lever on the same RL post-training step: the advantage estimator, not the trust-region term
- TIM — Training-Inference Mismatch — the off-policy gap that makes a trust region necessary in the first place
- RLVR — Reinforcement Learning with Verifiable Rewards — the post-training setup DRPO is built to stabilize