The news. On June 4, 2026, the MLEvolve paper (arXiv 2606.06473, from Shanghai AI Laboratory and East China Normal University) introduced an LLM-based, self-evolving multi-agent framework for end-to-end machine-learning algorithm discovery. On MLE-Bench it reaches state-of-the-art average medal rate under a 12-hour budget — half the standard runtime — and outperforms AlphaEvolve on mathematical algorithm optimization. Read the paper →

Picture a mountain you are trying to climb, where every distinct route to the top is a different candidate algorithm. Nobody can try every route — there are far too many — so you send out scouts: each one heads up a promising line and reports back how far it got. That report is a rollout, and growing a tree of routes scored by rollouts is exactly Monte Carlo Tree Search, the search-and-score loop behind a lot of game-playing agents.

The trouble with a plain tree is that every scout explores alone. A clever shortcut that one scout discovers high up on the left face never makes it onto anyone else's map — so a scout on the right face burns its own steps re-finding the very same thing, and three other routes independently dead-end at the same cliff. MLEvolve's Progressive Monte Carlo Graph Search fixes this with graph reference edges: when a branch finds a useful sub-result, it gets pinned to one shared map where every other branch can read it. The search is now a graph, not a tree — discoveries flow sideways between routes instead of being re-derived on each.

The second half of "progressive" is when to spend your scouts. Early on, MCGS deliberately scouts widely — an entropy-style schedule that keeps the search exploring many faces of the mountain. As evidence accumulates, the schedule tightens onto the most promising route and pours the remaining budget into climbing it. That shift from exploration to exploitation is the same budget-allocation problem every planning agent faces — MCGS just schedules it explicitly instead of using one fixed knob. MLEvolve also keeps the planner that picks routes separate from the workers that write the code, an orchestrator-and-workers split that lets each part specialize.

Put numbers on the shortcut-sharing to see why it pays (all numbers here are illustrative). Say a run has a search budget of 1,000 rollouts spread across 50 sibling branches, so each branch gets about 20. In a plain tree, suppose 10 of those branches each independently re-derive the same normalization trick before they can make progress — that is roughly 200 rollouts, a fifth of the whole budget, spent re-discovering one fact. With a graph edge, the first branch to find the trick pins it for the other 49 to read for free, so those ~200 wasted rollouts become ~200 rollouts of fresh exploration on routes nobody has tried — the same budget, aimed at new ground.

ApproachHow it searchesCross-branch sharingExplore vs exploit
Monte Carlo Tree Searcha tree of candidates, scored by rolloutsnone — each branch is on its ownone fixed balance (e.g. a constant)
AlphaEvolve (evolutionary)mutate & select a population of programsonly via the surviving populationset by mutation/selection pressure
Progressive MCGS (MLEvolve, arXiv)a graph of candidates, scored by rolloutsreference edges share sub-resultsa schedule: explore wide → exploit best

None of this is free: a graph needs the bookkeeping to decide which sub-results are worth pinning and which branches should read them, and a schedule needs tuning so it neither commits too early nor wanders too long. But once that machinery works, the search stops paying the same toll over and over — which is how MLEvolve gets to state-of-the-art on MLE-Bench in roughly half the usual runtime and edges past AlphaEvolve on a different domain entirely.

Goes deeper in: AI Agents → Planning & Reflection → Reasoning budget

Related explainers

Continue in trackAI Agents — Planning & Reflection: spending a reasoning budget on explore vs exploit

Frequently Asked Questions