The news. In June 2026, researchers (arXiv 2606.24133) reframed LLM pretraining data mixing as a sequential decision problem and learned the mixture on the fly with reinforcement learning. Using Soft Actor-Critic, a scheduler adjusts how much each data domain contributes at every step, guided by a reward that blends data quality, inter-domain influence, and model-weight signals. It reaches target validation perplexity on The Pile with 44% fewer training iterations and improves MMLU 0-shot by 7.2% over fixed-blend mixing. Read the paper →

Think about feeding an athlete in training. The simple way is a fixed meal plan — the same proportions of protein, carbs, and greens every single day, decided up front and never revisited. It works, but it ignores the obvious: what the body needs on a heavy-lifting week is not what it needs while tapering for a race. A good coach watches how training is going and adjusts the plate — more of this now, less of that later. This paper builds exactly that coach for LLM pretraining: instead of fixing the data mixture before training, it lets the mixture change as the model learns.

Concretely, the "coach" is a reinforcement-learning agent — Soft Actor-Critic — whose action at each step is the per-domain sampling mixture, and whose reward combines three signals: how good the data is, how much each domain helps or hurts the others (inter-domain influence), and statistics from the model's own weights. Because the action is a set of fractions that must sum to one, a continuous-control algorithm like SAC fits naturally. The result is a mixture that concentrates budget wherever the marginal value is highest right now, rather than a blend frozen at the start.

Why does adapting beat a good fixed recipe here? Because the value of each domain changes over training — early on the model needs broad coverage; later it gains more from harder, higher-quality data. A fixed blend has to compromise across all of training at once; the learned schedule does not have to. On The Pile, it reaches the same target perplexity as a fixed-blend baseline in 44% fewer training iterations — and fewer iterations usually means less GPU time when per-step cost is comparable, the compute that dominates the bill.

Sit with what 44% buys. If a fixed blend needs, say, 100,000 steps to hit the target, the scheduler gets there in about 56,000 (the paper reports the 44% reduction; the round numbers are illustrative) — and it does not trade quality away to do it, posting a 7.2% gain on MMLU 0-shot at the same time. The lesson is that the data schedule, long treated as a fixed knob you set once, is itself something worth learning — and learning it pays in both compute and capability.

ApproachHow the data mixture is chosenResult vs the fixed baseline
Fixed pretraining blendhand-tuned proportions, frozen for all of trainingThe default — one compromise across the whole run
RL data scheduler (SAC)adapts the mixture every step, by a multi-objective reward [paper]Target perplexity in 44% fewer steps · +7.2% MMLU 0-shot

Goes deeper in: LLM Internals → Text Generation → From Logits to Probabilities

Related explainers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based