The news. On June 11, 2026, researchers released HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness. It argues that an agent's performance is shaped not only by the model and the environment but by the harness that connects them — and that today's hand-engineered harnesses get harder to scale as trajectories grow longer. HarnessBridge replaces that glue with a lightweight learnable module trained end-to-end. On Terminal-Bench 2.0 and SWE-bench Verified it reports matching or beating strong specialized harnesses while using fewer tokens and shorter trajectories. Read the paper →
Picture dropping into a city where you don't speak the language. You hire an interpreter, and suddenly you can act: they listen to the noise around you and tell you only what matters, and they turn your requests into things locals will actually do. A hand-built harness is the phrasebook version of that interpreter: a fixed set of rules an engineer wires up by hand — and like a phrasebook, it breaks the moment the conversation goes off-script. That brittleness is exactly why a harness fails in production: real tasks wander off the rehearsed path, and the longer the agent runs, the more often they do.
HarnessBridge makes the harness itself learnable: two bidirectional projections, trained end-to-end, that sit between the agent and its environment — the interpreter who learned the city instead of memorizing a phrasebook. The observation projection is the listening half: it takes the raw trajectory — pages of tool output, stack traces, file dumps — and compresses it into a compact state the agent can actually use, the same scarce-context discipline a good engineer applies by hand. The action projection is the speaking half: it turns each proposed action into a well-formed, executable step — or, when the run so far says the action can't succeed, into a trajectory-grounded rejection before it ever touches the environment. Both halves are trained together on a harness-supervision dataset, so the controller decides what the agent sees and what it's allowed to do.
Where this earns its keep is the token math (illustrative — the paper reports aggregate token and trajectory-length reductions but not this per-step trace). Suppose a 40-step coding task where each tool call dumps roughly 2,000 tokens of raw output. Replay that whole trajectory into context and you carry 80,000 tokens of environment noise. The observation projection distills each step to about 200 tokens of decision-relevant state — 40 × 200 ≈ 8,000 tokens, a 10× cut — leaving room for the model to actually reason. Meanwhile the action projection catches a doomed command (an rm on a path that's already gone) and returns a grounded rejection, turning a crash-then-recover detour of 2 wasted steps into 0.
| Harness | Who builds it | Scales as trajectories grow? | Main cost |
|---|---|---|---|
| Hand-engineered glue | A human, per environment | No — fixed rules get brittle off-script | Cheap to start; expensive to maintain |
| Prompt-only scaffold (ReAct-style) | A prompt template | Partly — but raw observations still flood context | ~1 model call per step on full logs (illustrative) |
| HarnessBridge (learned) | Trained end-to-end on a harness-supervision dataset | Yes — projections compress state and vet each action | A training pass up front; fewer tokens at run time |
The honest read is that the paper reports its wins qualitatively — "matches or surpasses" hand-built harnesses with fewer tokens, rather than a single headline number. Trained once, HarnessBridge keeps up with strong specialized harnesses on Terminal-Bench 2.0 and SWE-bench Verified while using fewer tokens and shorter trajectories — and the same learned harness transfers from smaller models to bigger commercial ones. That last part is the quiet payoff: if the harness generalizes, you don't re-build the interface every time you swap the model underneath it — the brittle, hand-tuned layer that used to cap agent performance becomes something you train instead of maintain.
Goes deeper in: AI Agents → The Agent Loop & State → The Anatomy of a Harness and Agent Engineering → Production Harness Architecture → Why a Harness Fails in Production
Related explainers
- Harness-1 — state-externalizing search harness — keeps working memory outside the transcript; HarnessBridge instead learns the whole interface
- Crafter — multi-agent refinement harness with a directive critic — another way to make the harness smarter, via typed critique rather than a learned projection
- FutureSim — harness-level evaluation — why the harness, not just the model, is what you should be measuring