Tool router — Contextual-bandit tool routing
AgentThe news. On May 16, 2026, the authors posted a research paper that reframes tool selection inside agent harnesses as a contextual bandit problem — arXiv 2605.14241. Instead of picking equivalent providers by static rule, the router observes context features (task type, prior tool outputs, query difficulty), picks one provider, and updates its policy from the downstream answer-quality signal. The reward is the paper's "answer quality per service cycle" — a scalar that bakes final answer fidelity together with latency into a single number, so the policy can prefer a slower-but-more-accurate provider when the fidelity gap is worth the time.
Picture a busy taxi stand. A passenger walks up and asks for a ride to the airport. The dispatcher could send the next driver in line — pure round-robin, no learning. Or they could check a notebook of past trips, route by route, driver by driver — "Driver A is reportedly faster on airport runs, Driver B on downtown drops" — and pick accordingly. Send a new passenger asking for a downtown drop and the notebook has a different recommendation, because downtown traffic is a different problem than airport runs. After every ride the notebook gets one more tick mark in the right cell.
The agent harness has the same problem. Two tool providers can claim equivalence — "we both do web search" — but they may not be equivalent on the request types that actually show up. One search API might be the right call for factual lookups because its index is fresher; the other might be the right call for code documentation because it ranks API references above blog spam. Picking by lowest latency alone misses this entirely — the slower provider can be the right one when the fidelity gap pays back the time. Picking by round-robin averages over everything, which is the wrong answer twice.
The paper's move is to model this as a contextual bandit. Every incoming request comes with a feature vector — task family ("search" / "code" / "math"), prior tool outputs (was the previous turn a failure?), query difficulty (length, structure). The router consults a learned policy P(provider | context), picks one provider, sends the request, and observes a reward at the end of the turn. The reward — the paper's answer quality per service cycle — combines downstream answer fidelity with the turn's latency cost in a single scalar. Given enough turns (illustrative; the source does not report convergence-rate numbers), the policy fills out a sparse matrix with one strong cell per row — a learned preference per (task family × provider).
The hero animation walks this in three beats — all numbers in the animation are illustrative figures picked to make the mechanism legible, not benchmark results from the paper. On the left, three task-family chips — search, code, math — feed into a router diamond. To the right of the router sits the policy matrix: a 3×3 grid of P(provider | task) weights. During the naive beat, the matrix is uniform 0.33 — the router picks providers round-robin, the three provider lanes on the right rack up uneven quality bars, and the cumulative answer-quality strip at the bottom climbs slowly. The matrix then learns: off-diagonal cells fade toward 0 and diagonal cells (the right provider per family) rise toward 1. Every routed request now lands on the optimal provider and the cumulative strip lifts to a higher number. The headline contrast in the hero — 0.42 → 0.78 — is one such illustrative figure, not a number reported in the source.
Where the bandit framing earns its keep is a worked example with named numbers (illustrative; the source does not report per-turn quality numbers). Say the agent handles 300 turns per hour split evenly across three families. Under round-robin routing, only ~33% of turns hit the optimal provider for their family, so the average quality is roughly 1.0 × 0.33 + 0.45 × 0.67 ≈ 0.63 per turn. (The wrong-provider quality is non-zero because most providers can still complete most tasks, just less well.) Under the learned bandit policy, ~95% of turns hit the optimal provider, lifting the average to 1.0 × 0.95 + 0.45 × 0.05 ≈ 0.97 per turn — a ~54% lift in per-turn answer quality at the same latency budget (illustrative). The actual production gain depends on how big the per-family fidelity gap is and how stable the providers' quality stays over time.
Where bandit routing sits next to other tool-selection patterns
| Selection scheme | What it consults | What it learns | Where it earns its keep |
|---|---|---|---|
| Static rule | Hard-coded preference per task family | Nothing — engineer-encoded | Small toolsets where the right call is obvious |
| Round-robin | Last-picked provider | Nothing — load-balancing only | Equivalent providers with no quality gap |
| Lowest-latency | Provider latency table | Latency only (no fidelity) | Latency-bound traffic where all providers are accurate enough |
| Contextual-bandit router (this paper) | Task family + prior context | P(provider | task) from answer-quality feedback | Heterogeneous traffic where some providers are reliably better on some task families |
| LLM-as-router workflow | A small LLM call per request | Nothing — picks per request, no memory | Cold-start regimes with no historical reward signal yet |
The interesting compositional point: the contextual bandit is complementary to an LLM-as-router workflow, not a replacement. An LLM-router gives you reasonable behavior with zero history — it reads the request and picks. The bandit takes over once you have outcomes to learn from. A production stack can use the LLM-router as the warm-start prior and the bandit as the online correction layer; the paper's framing focuses on the bandit step and does not benchmark this exact composition. Bandit routing also pairs naturally with parallel tool calls when budget allows — query the top-K providers in parallel and let answer-quality scoring pick the winner, which doubles as a reward signal.
There is one new operational surface. The reward signal — answer quality per service cycle — has to be cheap, fast, and trustworthy enough to update the policy online. The paper's framing assumes a working evaluator exists; in practice that's the same evaluator pipeline you'd need for production evals — a small judge model or a rule-based scorer that produces a quality scalar per turn. Without it, the bandit has nothing to learn from, and you're back to round-robin. The reward shape also encodes the policy's preferences implicitly: weighting latency too heavily collapses the router to lowest-latency-wins; weighting fidelity too heavily ignores cost. The paper does not prescribe a single weighting — it's an application-level dial.
Goes deeper in: Agent Engineering → Cost & Latency → The Cost Profile of an Agent
Related explainers
- MCP SEP-2663 — async task handles for long-running tool calls — the transport-layer change that makes "fire one provider, wait on the handle, then update reward" practical
- AsyncFC — symbolic futures in the decode stream — the inference-side counterpart: don't stall decode while tools run
- Is Grep All You Need? — Grep vs vector for agentic retrieval — same flavor of finding: the harness around the tool dominates the tool-choice question