Agent-harness scaling law — Effective Feedback Compute (EFC)
AgentThe news. On May 28, 2026, researchers posted an agent-harness scaling-law paper to arXiv introducing Effective Feedback Compute (EFC) — a metric that predicts agent success from the quality of feedback the harness returns, not the compute it spends. Plotted against EFC, harness-run success rates fit a clean scaling law (reported R²≈0.94–0.99 across datasets); plotted against raw compute, the same runs barely fit (R²≈0.33–0.42, rising to ~0.88 only with a hand-built multivariate baseline). In one controlled comparison, lifting feedback quality moved success from 0.27 to 0.90 with token cost and tool calls held fixed.
Picture two students prepping for the same exam. The first logs ten hours re-reading the textbook cover to cover — enormous effort, page after page. The second spends one hour with a sharp tutor who, after each practice problem, points at the exact line where the reasoning went wrong, confirms the fix is correct, never repeats a note already written down, and makes sure it lands in the margin for next time. On exam day the second student wins, and it is not close. The hours-logged number — the raw compute — told you almost nothing. The number that predicted the grade was how much useful correction actually got absorbed. That second number is what this paper names Effective Feedback Compute, and the claim is that agent harnesses behave the same way.
The mechanism is a re-definition of the x-axis. Instead of counting tokens or tool invocations, EFC measures the useful signal the harness feeds back each step — scored on four axes (informativeness, validity, non-redundancy, retention) — and then normalizes by task demand so a crisp correction counts for more on a hard task than an easy one. That normalized quantity becomes the horizontal axis of a scaling law that fits success rates across the paper's datasets. The practical reading for anyone building agents: the lever is not your reasoning budget but what your harness chooses to log and return after every tool call.
This is why the raw-compute axis goes flat. A harness can burn an enormous budget returning low-quality feedback — a terse exit code 1 with no stack trace (low informativeness), a linter warning that is actually a false positive (low validity), the same "tests failed" string ten turns in a row (high redundancy), or an error the agent has already forgotten by the time it matters (low retention). All of that is real compute and real tool calls, and on the EFC axis it is worth almost nothing. The tutor who just says "study harder" for an hour spent the hour; the student learned nothing. Worse, in a long rollout the low-signal steps let compounding errors accumulate unchecked, so the spend actively buys you a longer path to the same failure.
Where the feedback gap actually comes from
Hold three variables fixed. One agent. One task. Two runs at the same budget — 40 tool calls, ~120K tokens each. The only difference is the harness's feedback quality. In Run A, every step returns a terse pass/fail string; say each step carries about 0.1 units of useful, valid, non-redundant, retained signal, so over 40 steps the agent accumulates 40 × 0.1 = 4 units. The task demands roughly 30 units to solve, so EFC = 4 / 30 ≈ 0.13 — low on the law's curve, landing near the 0.27 success rate the paper reports at the bottom of its range. In Run B, the harness returns the failing assertion, the offending input, and a one-line diff each step — call it 0.8 units per step, 40 × 0.8 = 32 units, EFC = 32 / 30 ≈ 1.07, high on the curve and up near 0.90 success. Same cost, same tool count, ~8× the effective feedback (illustrative decomposition calibrated to the paper's 0.27→0.90 and R² headline figures — the per-step unit values and task-demand figure are stand-ins, not measured constants). The success jump is the headline; the per-call yield jump is the deeper story.
| Scaling-law x-axis | What it counts | Fit to success (R²) |
|---|---|---|
| Raw compute | tokens + tool calls spent | ~0.33–0.42 — poor (paper) |
| Multivariate compute baseline | several spend features combined | ~0.88 — better, hand-built (paper) |
| Effective Feedback Compute (EFC) | 4-axis feedback quality ÷ task demand | ~0.94–0.99 — tight (paper) |
A caveat worth stating plainly: this is a scaling-law fit on the paper's own datasets, and a tight fit is a strong correlation, not a guaranteed control knob. EFC is also harder to move than a token budget — "return better feedback" is a design problem, not a slider, and scoring the four axes reliably is itself non-trivial. The honest framing is that EFC gives you a yardstick and a direction: instrument the feedback your harness returns, A/B candidate changes in shadow, and treat feedback quality as a first-class number alongside latency and cost. Whether the exact coefficients transfer to your stack is exactly the kind of thing you should measure, not assume.
Goes deeper in: AI Agents → Evals & Diagnostics → Error analysis first
Related explainers
- PushBench — Quantitative Goal Persistence (QGP) — another harness-level number for long-horizon agent reliability
- FutureSim — harness-level agent eval — why evaluating the harness, not the model alone, is the trend
- Cursor Composer 2.5 — targeted textual feedback RL — the training-time analogue: a sharp, targeted correction beats a blunt end-of-rollout reward