What is a contextual-bandit tool router?

It's an agent-harness component that picks between equivalent tool providers (two search APIs, two code executors, two summarization endpoints) by treating each request as a contextual bandit decision. The router observes context features (task family, prior tool outputs, query difficulty), consults a learned policy P(provider | context), picks one provider, observes the downstream answer-quality reward, and updates the policy. Over time it learns which provider performs best on which task family — instead of routing by lowest latency or round-robin alone. The paper's reward is "answer quality per service cycle," a scalar that combines final answer fidelity with latency cost so the router can trade them off.

How is bandit routing different from lowest-latency or round-robin routing?

Lowest-latency routing always picks the fastest provider — fine if all providers are accurate enough on every task, but on heterogeneous traffic it sends complex queries to a fast-but-shallow provider and wastes the slow-but-accurate one. Round-robin spreads load evenly but learns nothing. Bandit routing learns from outcomes: requests with the "math" tag start routing to whichever provider has historically delivered the best answer-quality / latency tradeoff on math queries, and the same for other task families. The trade-off is operational: bandit routing needs a working reward signal — a fast, cheap evaluator that scores each turn — and an explore/exploit policy (ε-greedy, UCB, Thompson sampling) so the router keeps testing under-used providers instead of locking in on the first apparent winner.

How does this relate to an LLM-as-router workflow?

They're complementary. An LLM-router reads the incoming request and picks a provider per call without any memory — useful as a cold-start prior when you have no outcome data yet. A bandit router uses observed rewards to learn a per-(task × provider) policy that, given enough turns and a reliable reward signal, can in principle improve on a zero-shot picker on the traffic it has seen. A production stack could use the LLM-router as the warm-start prior and the bandit as the online correction layer that overrides the prior once confidence on a given task family is high. The paper focuses on the bandit step and does not benchmark this exact composition.

Tool router paper — Contextual-bandit tool routing

Tool router — Contextual-bandit tool routing

Agent

learnaivisually.com/ai-explained/tool-router-contextual-bandit

Jargon

Contextual bandit: A simplified reinforcement-learning setting where on each step the agent sees a context (features describing the request), picks one of K arms (here: tool providers), and observes a reward. Unlike the full RL setting, there is no long-term state to plan around — every decision is one-shot. The router has to balance exploration (try less-used providers) against exploitation (pick the historical winner).
Tool provider: Any backend service the agent can call to advance a turn — a web search API, a code execution sandbox, a summarization endpoint. Two providers are "equivalent" when they answer the same kind of question but with different latency / cost / quality trade-offs.
Task family: A coarse-grained tag on the incoming request — "factual lookup", "code completion", "math reasoning". The bandit learns one preference per (family × provider) pair instead of one global preference.
Answer quality per service cycle: The paper's reward signal. A single scalar that combines final answer fidelity with latency cost per turn — so a slow-but-correct provider can outscore a fast-but-wrong one when the fidelity gap is large enough. See Agent Engineering → Cost & Latency → Cost Profile.
Exploration vs exploitation: The classic bandit trade-off. Pure exploitation always picks the historical winner and never learns about better providers; pure exploration ignores what you've already learned. Standard schemes — ε-greedy, UCB, Thompson sampling — schedule the mix so confidence intervals shrink fast on common task families and stay wide on rare ones.
Tool selection: The pre-bandit baseline: pick a provider by a static rule (lowest published latency, round-robin, hard-coded preference). See AI Agents → Tool Use → Tool Design for how this is usually wired.

The news. On May 16, 2026, the authors posted a research paper that reframes tool selection inside agent harnesses as a contextual bandit problem — arXiv 2605.14241. Instead of picking equivalent providers by static rule, the router observes context features (task type, prior tool outputs, query difficulty), picks one provider, and updates its policy from the downstream answer-quality signal. The reward is the paper's "answer quality per service cycle" — a scalar that bakes final answer fidelity together with latency into a single number, so the policy can prefer a slower-but-more-accurate provider when the fidelity gap is worth the time.

Picture a busy taxi stand. A passenger walks up and asks for a ride to the airport. The dispatcher could send the next driver in line — pure round-robin, no learning. Or they could check a notebook of past trips, route by route, driver by driver — "Driver A is reportedly faster on airport runs, Driver B on downtown drops" — and pick accordingly. Send a new passenger asking for a downtown drop and the notebook has a different recommendation, because downtown traffic is a different problem than airport runs. After every ride the notebook gets one more tick mark in the right cell.

The agent harness has the same problem. Two tool providers can claim equivalence — "we both do web search" — but they may not be equivalent on the request types that actually show up. One search API might be the right call for factual lookups because its index is fresher; the other might be the right call for code documentation because it ranks API references above blog spam. Picking by lowest latency alone misses this entirely — the slower provider can be the right one when the fidelity gap pays back the time. Picking by round-robin averages over everything, which is the wrong answer twice.

The paper's move is to model this as a contextual bandit. Every incoming request comes with a feature vector — task family ("search" / "code" / "math"), prior tool outputs (was the previous turn a failure?), query difficulty (length, structure). The router consults a learned policy P(provider | context), picks one provider, sends the request, and observes a reward at the end of the turn. The reward — the paper's answer quality per service cycle — combines downstream answer fidelity with the turn's latency cost in a single scalar. Given enough turns (illustrative; the source does not report convergence-rate numbers), the policy fills out a sparse matrix with one strong cell per row — a learned preference per (task family × provider).

The hero animation walks this in three beats — all numbers in the animation are illustrative figures picked to make the mechanism legible, not benchmark results from the paper. On the left, three task-family chips — search, code, math — feed into a router diamond. To the right of the router sits the policy matrix: a 3×3 grid of P(provider | task) weights. During the naive beat, the matrix is uniform 0.33 — the router picks providers round-robin, the three provider lanes on the right rack up uneven quality bars, and the cumulative answer-quality strip at the bottom climbs slowly. The matrix then learns: off-diagonal cells fade toward 0 and diagonal cells (the right provider per family) rise toward 1. Every routed request now lands on the optimal provider and the cumulative strip lifts to a higher number. The headline contrast in the hero — 0.42 → 0.78 — is one such illustrative figure, not a number reported in the source.

Where the bandit framing earns its keep is a worked example with named numbers (illustrative; the source does not report per-turn quality numbers). Say the agent handles 300 turns per hour split evenly across three families. Under round-robin routing, only ~33% of turns hit the optimal provider for their family, so the average quality is roughly 1.0 × 0.33 + 0.45 × 0.67 ≈ 0.63 per turn. (The wrong-provider quality is non-zero because most providers can still complete most tasks, just less well.) Under the learned bandit policy, ~95% of turns hit the optimal provider, lifting the average to 1.0 × 0.95 + 0.45 × 0.05 ≈ 0.97 per turn — a ~54% lift in per-turn answer quality at the same latency budget (illustrative). The actual production gain depends on how big the per-family fidelity gap is and how stable the providers' quality stays over time.

Where bandit routing sits next to other tool-selection patterns

Selection scheme	What it consults	What it learns	Where it earns its keep
Static rule	Hard-coded preference per task family	Nothing — engineer-encoded	Small toolsets where the right call is obvious
Round-robin	Last-picked provider	Nothing — load-balancing only	Equivalent providers with no quality gap
Lowest-latency	Provider latency table	Latency only (no fidelity)	Latency-bound traffic where all providers are accurate enough
Contextual-bandit router (this paper)	Task family + prior context	P(provider \| task) from answer-quality feedback	Heterogeneous traffic where some providers are reliably better on some task families
LLM-as-router workflow	A small LLM call per request	Nothing — picks per request, no memory	Cold-start regimes with no historical reward signal yet

The interesting compositional point: the contextual bandit is complementary to an LLM-as-router workflow, not a replacement. An LLM-router gives you reasonable behavior with zero history — it reads the request and picks. The bandit takes over once you have outcomes to learn from. A production stack can use the LLM-router as the warm-start prior and the bandit as the online correction layer; the paper's framing focuses on the bandit step and does not benchmark this exact composition. Bandit routing also pairs naturally with parallel tool calls when budget allows — query the top-K providers in parallel and let answer-quality scoring pick the winner, which doubles as a reward signal.

There is one new operational surface. The reward signal — answer quality per service cycle — has to be cheap, fast, and trustworthy enough to update the policy online. The paper's framing assumes a working evaluator exists; in practice that's the same evaluator pipeline you'd need for production evals — a small judge model or a rule-based scorer that produces a quality scalar per turn. Without it, the bandit has nothing to learn from, and you're back to round-robin. The reward shape also encodes the policy's preferences implicitly: weighting latency too heavily collapses the router to lowest-latency-wins; weighting fidelity too heavily ignores cost. The paper does not prescribe a single weighting — it's an application-level dial.

Goes deeper in: Agent Engineering → Cost & Latency → The Cost Profile of an Agent

Related explainers

MCP SEP-2663 — async task handles for long-running tool calls — the transport-layer change that makes "fire one provider, wait on the handle, then update reward" practical
AsyncFC — symbolic futures in the decode stream — the inference-side counterpart: don't stall decode while tools run
Is Grep All You Need? — Grep vs vector for agentic retrieval — same flavor of finding: the harness around the tool dominates the tool-choice question

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based