Maestro — RL orchestrator over frozen experts
AgentThe news. On May 21, 2026, the authors posted Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles on arXiv. The headline claims: a 4B orchestrator averages 70.1% across 10 multimodal benchmarks, ahead of GPT-5 (69.3%) and Gemini 2.5 Pro (68.7%), while never updating a single expert weight. On four held-out hard benchmarks with expert pools the policy was never trained against, it still reportedly scores 59.5% — the evidence cited for the "generalizes to unseen experts" framing.
Picture the metaphor: a busy ER, one triage nurse at the front desk, three specialist doors behind her. A patient with chest pain walks in; she sends them to cardiology. A patient with a sprained wrist; ortho. A patient with a sudden migraine; neuro. The nurse is not curing anyone — she's just routing. Years of seeing which specialist actually fixed which complaint trained her to read the clues fast: a hand on the chest, the way someone limps in, the description of the headache. The expensive expertise is downstream of her decision, not in her head. That asymmetry is the whole Maestro argument: the orchestrator can be tiny because the heavy lifting is already sitting in the specialist pool.
The mechanism is a policy network — call it π — whose state is the task plus the available expert pool, and whose action is the joint (expert, skill) tuple. The paper trains π with reinforcement learning against an end-to-end reward — did the system, after Maestro routed the task and the chosen expert answered with the chosen skill, get the right answer? Over the training run π learns to favor the routes that historically paid off, while the experts themselves are never touched. The decision is which expert AND which skill on that expert — both matter, because a good expert invoked through the wrong skill still wastes the call.
The harder claim — and the one worth dwelling on — is generalization to unseen experts. Most learned routers do not survive a new model showing up: the policy has memorized "GPT-4 wins on math, Claude wins on summary," and the moment you swap in a different code model, all of that memorized routing goes stale. The Maestro paper reports the policy still scores ~59.5% on four held-out hard benchmarks against expert pools it was never trained against. That number is well below the in-distribution 70.1%, but it is well above what an identity-keyed router would deliver — which is essentially noise on an unseen roster. The paper's framing is that the policy is routing by capability shape rather than memorized expert identity, though the specific representation it uses to do this is not enumerated in the abstract.
Where it earns its keep against existing options is concrete. A frontier monolith pays the full forward pass on every token of every task, even ones a much smaller model could have answered correctly. A contextual-bandit tool router gets you per-(task × provider) routing but only on a fixed roster — add a provider and the bandit has to re-explore from scratch. Maestro sits in between: an RL policy that learns a richer state representation than the bandit, but unlike the monolith, the bulk of the compute is dispatched to whichever frozen expert is appropriate for that task. The accuracy comparison is the headline payoff: 4B Maestro at 70.1% vs frontier baselines at 69.3% (GPT-5) and 68.7% (Gemini 2.5 Pro) (paper-reported) — with most of the inference compute pushed onto the expert that was actually selected rather than spent on every token through a single giant model.
What changes vs. an LLM-as-router or a bandit router
The shape of routers people have shipped to date is concrete, and the Maestro contribution is best read by comparison.
| Approach | Action space | State representation | Adds a new expert means… |
|---|---|---|---|
| Hand-written router | expert only (no skill) | hand-coded rules on the task string | edit the rules; ship a deploy |
| LLM-as-router (zero-shot) | expert only (no skill) | the prompt the router model reads | edit the router's expert list and prompt; no learned correction |
| Contextual bandit router | expert only | handful of task features + provider history | re-explore — bandit starts cold on the new arm |
| Maestro (this paper) | joint (expert, skill) | task + the available expert pool (representation not enumerated) | policy generalizes — no retraining required (paper claim, held-out 59.5%) |
The other implication is that Maestro composes cleanly with multi-agent topologies. The Workflow Patterns module describes orchestrator-worker setups where a supervisor dispatches subtasks to worker agents. Maestro is a learned orchestrator in that pattern — a drop-in for the hand-written supervisor — and it changes the failure modes the Agent Teams module catalogs. A hand-written supervisor that picks the wrong worker fails predictably (the same wrong worker every time); an RL supervisor's failure mode is more probabilistic and reward-shaped, which is harder to debug but easier to keep improving with more rollouts.
There is a real cost worth stating out loud. RL on a 4B policy with multimodal experts is not free — the paper does not publish a training-compute figure, so the actual amortization point depends on traffic volume and the specifics of the rollout setup. Bandit routing remains the right answer when you have a stable roster of under ten providers, a fast reward signal, and don't need a joint (expert, skill) action space; Maestro becomes the right answer when the pool is large, the skills matter, and you expect the pool to grow.
Goes deeper in: AI Agents → Workflow Patterns → Orchestrator-Workers