What is the FutureSim benchmark?

FutureSim, released by Max Planck Institute for Intelligent Systems, is a benchmark that replays real-world news articles in chronological order and asks agents to forecast events that resolve over a 3-month horizon (January–March 2026). Each agent runs in its native harness — its actual retrieval, planning, and forecasting loop — so the score reflects end-to-end agent behaviour, not just raw model predictions. The headline result is that the best frontier agent reaches only about 25% accuracy, and many agents score worse Brier skill than a no-prediction baseline.

What is harness-level evaluation, and how is it different from single-shot QA?

Single-shot QA evaluates a model with one input prompt and one output, graded against a fixed answer key — MMLU, GPQA, GSM8K, MATH all work this way. Harness-level evaluation instead lets the whole agent stack — retrieval, planning, tool calls, reasoning over multiple turns — run end-to-end against a task that unfolds over time. The same underlying model in two different harnesses can score very differently on a harness-level eval, because failures that compound across turns (bad retrieval, drift, premature stopping, overconfidence) become visible.

Why do many frontier agents score below the no-prediction baseline?

The no-prediction baseline is the score an agent would receive by always saying 'I don't know' — silence is non-negative information under Brier skill scoring. An agent that confidently makes wrong calls produces negative information: its forecasts mislead more than they inform. Combined with multi-turn failure modes that single-shot QA cannot expose — drift, compounding retrieval misses, over-commitment — many frontier agents end up doing worse than just sitting out. This is exactly the kind of failure FutureSim's authors argue harness-level evals are designed to surface.

FutureSim benchmark — Harness-level agent eval vs single-shot QA

FutureSim — Harness-level agent eval vs single-shot QA

Agent

learnaivisually.com/ai-explained/futuresim-harness-level-eval

TL;DR

What is it: The FutureSim benchmark from Max Planck Institute for Intelligent Systems replays real-world news articles chronologically over a 3-month horizon (Jan–Mar 2026) and asks agents to forecast events — a harness-level eval that scores the WHOLE agent kit (retrieval, planning, reasoning over evolving evidence), not just a single forward pass.
Why it’s needed: Single-shot QA gives every model the same closed-book question and grades the answer once — it cannot catch failures that compound across turns: bad retrieval, drift, plan abandonment, premature stopping. Harness-level evals like FutureSim expose those by letting the agent run in its native runtime against a moving target.
vs previous: Single-shot QA (the dominant LLM benchmark style — MMLU, GPQA, GSM8K) freezes the question at evaluation time and grades one forward pass. FutureSim hands the agent a 3-month stream of news, makes the answer resolve over time, and scores the harness end-to-end. The best frontier agent reaches only ~25% accuracy; many score below the no-prediction baseline.

Jargon

Harness: The runtime layer that wraps an LLM into an agent — schedules tool calls, manages memory, retries on failure, formats outputs. The same model swapped between two harnesses can score very differently on the same task. FutureSim runs each agent in its native harness so the eval scores the whole stack, not just the model.
Single-shot QA: Standard LLM eval style: one input prompt, one output, graded against a fixed answer key. MMLU, GPQA, GSM8K, MATH all work this way. Cheap and reproducible, but cannot test anything that requires multiple turns, evolving evidence, or tool use over time.
Knowledge cutoff: The training-data date a model was frozen at. Questions about events after the cutoff cannot be answered from parametric memory — the agent has to retrieve. FutureSim picks a 3-month window (Jan–Mar 2026) that sits beyond most current cutoffs, forcing the agent to use its harness.
Brier skill score: A scoring rule for probabilistic forecasts: it rewards calibration (saying 70% when something happens 70% of the time) and punishes overconfident wrong calls. Compared against a "no-prediction" baseline. An agent with worse Brier skill than the baseline is contributing negative information — its predictions are worse than silence.
Chronological replay: FutureSim does not dump all the news at once. Articles arrive in their actual publication order, and each forecast is locked in BEFORE the resolving event happens — no leakage from the future. This is what makes the eval a stand-in for "what would the agent have done in production?"
Compounding errors: Multi-turn failure mode: a retrieval miss on turn 2 corrupts the plan on turn 3, which picks the wrong tool on turn 4, which feeds bad context to turn 5. A single-shot QA eval cannot see compounding — every question is its own fresh prompt. Harness-level evals can.
FutureSim: The benchmark introduced by Max Planck. Real news articles, real 3-month horizon, real publication order. The headline number across the evaluated frontier agents: the best reaches ~25% accuracy, several score below the no-prediction baseline.

The news. On May 15, 2026, Max Planck Institute for Intelligent Systems released FutureSim — a benchmark that replays real-world news articles in chronological order and asks agents to forecast events that resolve over a 3-month horizon (Jan–Mar 2026). The benchmark runs each agent in its native harness so the eval scores end-to-end behaviour — retrieval, planning, reasoning over evolving evidence — not just single-shot model predictions. Headline result: the best frontier agent reaches only ~25% accuracy, and many score worse Brier skill than the no-prediction baseline — they would have done better by saying nothing. Read the paper →

Picture the pop quiz. The teacher hands you a single sheet, you fill in one answer, the teacher grades it, you move on. That is exactly what a single-shot QA benchmark does to an LLM: one prompt in, one answer out, one mark against an answer key. The whole agent harness — the tool calls, the planning loop, the retrieval, the reasoning over many turns — is invisible to that grade. Whatever the harness does well or badly inside the LLM-OS, the quiz can't see it.

Now picture the news-prediction contest. You enrol on January 1. Every day for three months, articles arrive in the paper in actual publication order — talks reopen, a tariff vote is set, markets dip, a joint statement is issued. You forecast events that will resolve over the coming weeks ("Will the trade deal close by Q1?"), and the calendar settles each forecast by simply waiting for the date to arrive. The contest grades the whole you: how you search, what notes you keep, which fresh evidence you weigh, when you revise. There is no fixed answer key — only the world unfolding.

That is the structural difference FutureSim is built around. It is harness-level, not model-level. The benchmark runs each agent under its provided harness — the retrieval tools, the planning loop, and the memory setup it ships with — and lets the whole stack run against a moving target. Two agents built on the same underlying model can score very differently under different harnesses, and a model that aces MMLU can collapse on FutureSim. The score is not the model alone; it is the agent.

Why "25%" is the punchline

Across the frontier agents Max Planck evaluated, the best reaches roughly 25% accuracy. That is a small absolute number, but the more revealing detail is Brier skill: a calibration-aware scoring rule that compares an agent's probabilistic forecasts against a do-nothing baseline. Many of the evaluated frontier agents score worse Brier skill than the no-prediction baseline. They would have done better by always saying "I don't know."

This is exactly the failure mode that single-shot QA cannot expose. On a single-shot quiz, "I don't know" is a wrong answer — graded the same as any other miss. On FutureSim it is the strongest baseline: silence is non-negative information. An agent that drifts, hallucinates, or over-commits is negative information. That is what makes a calibrated probabilistic forecast different from a fluent-sounding wrong one.

The benchmark also splits the field crisply — there is a clear separation between the top agent and the rest. In production-eval terms (Agent Engineering → Production Evals & Shadow Mode) that's the signal you'd want: a benchmark you can actually rank by, not one where everyone clusters at 95%.

Why the harness, not just the model, is what fails

FutureSim's authors lean on a specific framing: when the eval lets the agent run end-to-end, harness behaviour can materially affect the score. That maps onto a now-standard list of agent failure modes:

Failure mode	What it looks like in production	Caught by single-shot QA?	Caught by harness-level eval?
Bad retrieval	Pulls the wrong articles; reasons over irrelevant context	No — single-shot prompts include the right context	Yes — the agent has to retrieve itself over evolving evidence
Compounding errors	Turn-2 miss corrupts turn-3 plan corrupts turn-4 tool call	No — every prompt is fresh	Yes — the run continues over many turns
Plan abandonment / drift	Agent forgets the original forecast and chases new evidence	No	Yes — the prediction must be revised, not restated
Premature stopping	Locks in a forecast before key articles arrive	No — there are no later turns to skip	Yes — early lock-ins are graded against later events
Overconfidence	Confident wrong calls outweigh confident right calls	Partly — single-shot calibration is measurable but flat	Yes — Brier skill punishes confident wrongs over time

A worked example makes the difference concrete. Suppose you have two agents, A and B, both built on the same base model. A's harness retrieves three articles per day and revises its forecast as new ones arrive; B's harness retrieves once on day 0 and locks the forecast. On a single-shot quiz over the same questions, A and B would score the same — the prompt was identical, the model was identical. On FutureSim, A revises after the day-25 markets-dip article and the day-38 joint statement; B is stuck on its day-0 view. If reality turns into "deal does NOT close by Q1," A's revised forecast lands at ~0.3 while B is locked at ~0.8 — and the Brier-skill gap between them is ~0.55 — the same that separates a competent forecaster from one scoring below the no-prediction baseline. (Numbers illustrative, calibrated to the Brier-rule magnitudes.) The harness, not the model, made the difference. Single-shot QA scored them identically; FutureSim separated them.

This is also why FutureSim's framing pairs naturally with the Span-per-Tick observability model. Once a benchmark can score the harness end-to-end, the next question is: which tick caused the failure? That is a question single-shot QA cannot even pose.

Goes deeper in: AI Agents → Evals & Diagnostics → The 4 Eval Failure Modes

Related explainers

Is Grep All You Need? — Grep vs vector retrieval for agentic search — also makes the point that harness design dominates the algorithm choice
AsyncFC — Symbolic futures in the decode stream — the harness-layer pattern that overlaps decode and tool calls
MCP SEP-2663 — async task handles — the protocol counterpart for long-running tool calls in a harness

FutureSim — Harness-level agent eval vs single-shot QA

Why "25%" is the punchline

Why the harness, not just the model, is what fails

Related explainers

Frequently Asked Questions