What does Maestro actually decide on each task?

A joint (expert, skill) tuple. Given the task description and the available expert pool, the 4B RL policy outputs a distribution over experts and over each expert's available skills, samples the tuple, and dispatches the task. The expert answers with the chosen skill and the system returns the answer. The reward signal during training is whether that combination produced the right final answer. Prior tool routers — hand-written rules, LLM-as-router, contextual bandits — act only over expert identity; Maestro's larger action space is part of why it captures more of the available capability across a heterogeneous pool.

How does Maestro generalize to expert pools it never trained against?

The paper reports the policy reaches 59.5% on four held-out hard benchmarks where the expert pool was never seen during training, and frames this as evidence that the policy routes by capability shape rather than memorized expert identity. The exact representation the policy uses to identify experts at the input is not enumerated in the paper's abstract. The 59.5% number sits well below the in-distribution 70.1% but is still well above what an identity-keyed router would achieve on an unseen roster, which is essentially noise.

When should I reach for Maestro instead of a contextual-bandit tool router?

Bandit routing wins when the roster is small and stable (under ten providers), the reward signal is fast and cheap to compute per turn, and the action is just 'pick an expert.' Maestro becomes the right answer when the pool is large and growing, when skills inside an expert matter (so the action space has to be the joint (expert, skill) tuple), and when you can afford an offline RL training run that amortizes across the traffic that will use the resulting policy. They aren't strict substitutes — a production stack could use a fast bandit as a warm-start prior on a small roster and switch to an RL orchestrator once the pool stabilizes and the bandit's per-arm exploration starts to feel expensive.

Maestro paper — RL orchestrator over frozen experts

Maestro — RL orchestrator over frozen experts

Agent

learnaivisually.com/ai-explained/maestro-rl-orchestrator-frozen-experts

Jargon

Orchestrator: The component in a multi-model system that decides who handles each task. In Maestro it is a 4B policy network; the Workflow Patterns module covers the broader orchestrator-workers pattern in the agent track.
Frozen experts: Expert models whose weights are never updated during training. Only the orchestrator sees gradient updates. That separation is what lets the orchestrator generalize — it never overfits to a specific expert's quirks.
RL policy π(action | state): A function that outputs the probability of each action given the current state. For Maestro the state is the task plus the available expert pool; the action is the joint (expert, skill) tuple. The specific input representation the policy uses for the pool is not enumerated in the paper's abstract.
Skill: A named capability an expert exposes — "summarize," "execute code," "describe an image," "answer with citations." Most experts expose several skills. Picking an expert without picking its skill leaves capability on the table; that's why the action space is the joint tuple.
Held-out benchmark: A benchmark whose tasks and whose expert pool the orchestrator never saw during training. Maestro hits 59.5% on four such benchmarks — the headline evidence that the policy isn't just memorizing which expert is best on the training set.
Contextual bandit: An older online-learning recipe for the same kind of decision: pick an arm (provider, tool, expert) per context, observe a reward, update the arm-selection policy. Sample-efficient on narrow setups. The tool-router explainer covers the bandit version; Maestro uses end-to-end RL instead, which trades sample efficiency for a larger action space and richer state representation.

The news. On May 21, 2026, the authors posted Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles on arXiv. The headline claims: a 4B orchestrator averages 70.1% across 10 multimodal benchmarks, ahead of GPT-5 (69.3%) and Gemini 2.5 Pro (68.7%), while never updating a single expert weight. On four held-out hard benchmarks with expert pools the policy was never trained against, it still reportedly scores 59.5% — the evidence cited for the "generalizes to unseen experts" framing.

Picture the metaphor: a busy ER, one triage nurse at the front desk, three specialist doors behind her. A patient with chest pain walks in; she sends them to cardiology. A patient with a sprained wrist; ortho. A patient with a sudden migraine; neuro. The nurse is not curing anyone — she's just routing. Years of seeing which specialist actually fixed which complaint trained her to read the clues fast: a hand on the chest, the way someone limps in, the description of the headache. The expensive expertise is downstream of her decision, not in her head. That asymmetry is the whole Maestro argument: the orchestrator can be tiny because the heavy lifting is already sitting in the specialist pool.

The mechanism is a policy network — call it π — whose state is the task plus the available expert pool, and whose action is the joint (expert, skill) tuple. The paper trains π with reinforcement learning against an end-to-end reward — did the system, after Maestro routed the task and the chosen expert answered with the chosen skill, get the right answer? Over the training run π learns to favor the routes that historically paid off, while the experts themselves are never touched. The decision is which expert AND which skill on that expert — both matter, because a good expert invoked through the wrong skill still wastes the call.

The harder claim — and the one worth dwelling on — is generalization to unseen experts. Most learned routers do not survive a new model showing up: the policy has memorized "GPT-4 wins on math, Claude wins on summary," and the moment you swap in a different code model, all of that memorized routing goes stale. The Maestro paper reports the policy still scores ~59.5% on four held-out hard benchmarks against expert pools it was never trained against. That number is well below the in-distribution 70.1%, but it is well above what an identity-keyed router would deliver — which is essentially noise on an unseen roster. The paper's framing is that the policy is routing by capability shape rather than memorized expert identity, though the specific representation it uses to do this is not enumerated in the abstract.

Where it earns its keep against existing options is concrete. A frontier monolith pays the full forward pass on every token of every task, even ones a much smaller model could have answered correctly. A contextual-bandit tool router gets you per-(task × provider) routing but only on a fixed roster — add a provider and the bandit has to re-explore from scratch. Maestro sits in between: an RL policy that learns a richer state representation than the bandit, but unlike the monolith, the bulk of the compute is dispatched to whichever frozen expert is appropriate for that task. The accuracy comparison is the headline payoff: 4B Maestro at 70.1% vs frontier baselines at 69.3% (GPT-5) and 68.7% (Gemini 2.5 Pro) (paper-reported) — with most of the inference compute pushed onto the expert that was actually selected rather than spent on every token through a single giant model.

What changes vs. an LLM-as-router or a bandit router

The shape of routers people have shipped to date is concrete, and the Maestro contribution is best read by comparison.

Approach	Action space	State representation	Adds a new expert means…
Hand-written router	expert only (no skill)	hand-coded rules on the task string	edit the rules; ship a deploy
LLM-as-router (zero-shot)	expert only (no skill)	the prompt the router model reads	edit the router's expert list and prompt; no learned correction
Contextual bandit router	expert only	handful of task features + provider history	re-explore — bandit starts cold on the new arm
Maestro (this paper)	joint (expert, skill)	task + the available expert pool (representation not enumerated)	policy generalizes — no retraining required (paper claim, held-out 59.5%)

The other implication is that Maestro composes cleanly with multi-agent topologies. The Workflow Patterns module describes orchestrator-worker setups where a supervisor dispatches subtasks to worker agents. Maestro is a learned orchestrator in that pattern — a drop-in for the hand-written supervisor — and it changes the failure modes the Agent Teams module catalogs. A hand-written supervisor that picks the wrong worker fails predictably (the same wrong worker every time); an RL supervisor's failure mode is more probabilistic and reward-shaped, which is harder to debug but easier to keep improving with more rollouts.

There is a real cost worth stating out loud. RL on a 4B policy with multimodal experts is not free — the paper does not publish a training-compute figure, so the actual amortization point depends on traffic volume and the specifics of the rollout setup. Bandit routing remains the right answer when you have a stable roster of under ten providers, a fast reward signal, and don't need a joint (expert, skill) action space; Maestro becomes the right answer when the pool is large, the skills matter, and you expect the pool to grow.

Goes deeper in: AI Agents → Workflow Patterns → Orchestrator-Workers

Continue in trackWorkflow Patterns: Orchestrator-workers

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based