FutureSim — Harness-level agent eval vs single-shot QA

Agent
L
Single-shot QA: one Q, one A · Harness-level eval: 3 months of news, scored over timeSINGLE-SHOT QAQWill the trade dealclose by Q1?HARNESS-LEVEL EVAL — agent works over the 3-month news feedAgent harnessretplaforNews articles arrive chronologically →Jan 2026Feb 2026Mar 2026Agent forecasts:TOP-1 ACCURACY — frontier agents on the 3-month replay0%25%50%best agent ~25%← other evaluated agentsBest frontier agent on FutureSim~25% accuracyBrier skill: many agents score WORSE than no prediction
learnaivisually.com/ai-explained/futuresim-harness-level-eval

The news. On May 15, 2026, Max Planck Institute for Intelligent Systems released FutureSim — a benchmark that replays real-world news articles in chronological order and asks agents to forecast events that resolve over a 3-month horizon (Jan–Mar 2026). The benchmark runs each agent in its native harness so the eval scores end-to-end behaviour — retrieval, planning, reasoning over evolving evidence — not just single-shot model predictions. Headline result: the best frontier agent reaches only ~25% accuracy, and many score worse Brier skill than the no-prediction baseline — they would have done better by saying nothing. Read the paper →

Picture the pop quiz. The teacher hands you a single sheet, you fill in one answer, the teacher grades it, you move on. That is exactly what a single-shot QA benchmark does to an LLM: one prompt in, one answer out, one mark against an answer key. The whole agent harness — the tool calls, the planning loop, the retrieval, the reasoning over many turns — is invisible to that grade. Whatever the harness does well or badly inside the LLM-OS, the quiz can't see it.

Now picture the news-prediction contest. You enrol on January 1. Every day for three months, articles arrive in the paper in actual publication order — talks reopen, a tariff vote is set, markets dip, a joint statement is issued. You forecast events that will resolve over the coming weeks ("Will the trade deal close by Q1?"), and the calendar settles each forecast by simply waiting for the date to arrive. The contest grades the whole you: how you search, what notes you keep, which fresh evidence you weigh, when you revise. There is no fixed answer key — only the world unfolding.

That is the structural difference FutureSim is built around. It is harness-level, not model-level. The benchmark runs each agent under its provided harness — the retrieval tools, the planning loop, and the memory setup it ships with — and lets the whole stack run against a moving target. Two agents built on the same underlying model can score very differently under different harnesses, and a model that aces MMLU can collapse on FutureSim. The score is not the model alone; it is the agent.

Why "25%" is the punchline

Across the frontier agents Max Planck evaluated, the best reaches roughly 25% accuracy. That is a small absolute number, but the more revealing detail is Brier skill: a calibration-aware scoring rule that compares an agent's probabilistic forecasts against a do-nothing baseline. Many of the evaluated frontier agents score worse Brier skill than the no-prediction baseline. They would have done better by always saying "I don't know."

This is exactly the failure mode that single-shot QA cannot expose. On a single-shot quiz, "I don't know" is a wrong answer — graded the same as any other miss. On FutureSim it is the strongest baseline: silence is non-negative information. An agent that drifts, hallucinates, or over-commits is negative information. That is what makes a calibrated probabilistic forecast different from a fluent-sounding wrong one.

The benchmark also splits the field crisply — there is a clear separation between the top agent and the rest. In production-eval terms (Agent Engineering → Production Evals & Shadow Mode) that's the signal you'd want: a benchmark you can actually rank by, not one where everyone clusters at 95%.

Why the harness, not just the model, is what fails

FutureSim's authors lean on a specific framing: when the eval lets the agent run end-to-end, harness behaviour can materially affect the score. That maps onto a now-standard list of agent failure modes:

Failure modeWhat it looks like in productionCaught by single-shot QA?Caught by harness-level eval?
Bad retrievalPulls the wrong articles; reasons over irrelevant contextNo — single-shot prompts include the right contextYes — the agent has to retrieve itself over evolving evidence
Compounding errorsTurn-2 miss corrupts turn-3 plan corrupts turn-4 tool callNo — every prompt is freshYes — the run continues over many turns
Plan abandonment / driftAgent forgets the original forecast and chases new evidenceNoYes — the prediction must be revised, not restated
Premature stoppingLocks in a forecast before key articles arriveNo — there are no later turns to skipYes — early lock-ins are graded against later events
OverconfidenceConfident wrong calls outweigh confident right callsPartly — single-shot calibration is measurable but flatYes — Brier skill punishes confident wrongs over time

A worked example makes the difference concrete. Suppose you have two agents, A and B, both built on the same base model. A's harness retrieves three articles per day and revises its forecast as new ones arrive; B's harness retrieves once on day 0 and locks the forecast. On a single-shot quiz over the same questions, A and B would score the same — the prompt was identical, the model was identical. On FutureSim, A revises after the day-25 markets-dip article and the day-38 joint statement; B is stuck on its day-0 view. If reality turns into "deal does NOT close by Q1," A's revised forecast lands at ~0.3 while B is locked at ~0.8 — and the Brier-skill gap between them is ~0.55 — the same that separates a competent forecaster from one scoring below the no-prediction baseline. (Numbers illustrative, calibrated to the Brier-rule magnitudes.) The harness, not the model, made the difference. Single-shot QA scored them identically; FutureSim separated them.

This is also why FutureSim's framing pairs naturally with the Span-per-Tick observability model. Once a benchmark can score the harness end-to-end, the next question is: which tick caused the failure? That is a question single-shot QA cannot even pose.

Goes deeper in: AI Agents → Evals & Diagnostics → The 4 Eval Failure Modes

Related explainers

Frequently Asked Questions