What is a latent failure in LLM planning?

A latent failure is a plan step that executes without raising any error yet silently undermines the goal — the plan "completes," but the end state is wrong. SIMMER (arXiv 2606.14574) distinguishes it from a precondition violation, which fails loudly and immediately, and shows that up to 56% of frontier-LLM plans contain a latent failure that an outcome-only check can miss.

Why do latent failures matter for AI agents?

Because the failures that don't throw are the ones a "did it crash / did the answer match" eval can quietly pass. An agent can run a long plan, have every step succeed, and still fail the task — and you only find out downstream. SIMMER finds only 17% of plans across six frontier models are fully error-free, so silent mid-plan errors are a dominant, under-measured failure mode.

How does counterfactual foresight reduce latent failures?

Counterfactual foresight makes the planner simulate a step's effect on the world state before committing it, then check whether the predicted end state still meets the goal. Because the simulated state reveals the silent corruption up front, SIMMER reports counterfactual foresight cuts latent failures by up to 72% and irreversible failures by up to 75%.

SIMMER: 56% of frontier-LLM plans hide latent failures — Latent failures in planning

Jargon

Latent failure: A plan step that runs without raising an error but silently corrupts the goal — the plan "completes," yet the end state is wrong. Contrast with a crash, which stops execution and is obvious.
Symbolic world model: An explicit, rule-based model of the environment's state — here a kitchen of 77 actions and 262 objects — that tracks what is true after each action (what's thawed, what's in each pot). It is the ground truth a plan is checked against.
Precondition violation: An action attempted when its requirements aren't met — sautéing in an empty pan. This is an immediate failure: the world model rejects the step on the spot, so it's the easy kind to catch.
Irreversible failure: A latent failure that can't be undone — you can't un-burn a sauce. Once committed, no later step recovers the goal, which is why catching it before acting matters most.
Counterfactual foresight: Before committing a step, the planner simulates its effect on the world state ("if I do this, what becomes true?") and checks whether the predicted end state still meets the goal. See AI Agents → When to Pause and Observe.
State-machine executor: The benchmark's checker: it steps the world model forward one action at a time, flagging immediate precondition violations, latent hazards, and irreversible failures as they arise.

The news. On June 12, 2026, researchers released SIMMER, a planning benchmark built on a human-curated symbolic kitchen world model — 77 actions, 262 objects, ~46,800 realistic interactions. Driving six frontier LLMs through it, they found only 17% of plans are error-free, and up to 56% contain a "latent failure": a step that executes without halting but silently undermines the goal. A state-machine executor validates each step against the world model, and adding explicit counterfactual foresight sharply cuts the silent errors. Read the paper →

Picture a line cook working a recipe. Every step is physically doable — chop, sear, simmer, plate — so nothing ever throws an error; the timer dings and the plate goes out. But somewhere in the middle the cook salted twice, or thawed something that had to stay frozen. The meal is quietly ruined and no alarm ever sounded. That is a latent failure: the plan ran to completion, every step "succeeded," and the goal still failed — which is exactly the kind of error a single-shot plan is most likely to bury.

SIMMER catches it by giving the kitchen a symbolic world model — an explicit record of the real state after every action: what's thawed, what's in each pot, what can no longer be undone. A state-machine executor then walks the plan one step at a time and sorts what goes wrong into three kinds. Only the first kind — a precondition violation, like sautéing in an empty pan — fails loudly; the other two can slip past a check that only asks "did it crash" or "did the final answer match."

Failure kind	Does it halt execution?	Does the goal fail?	Caught by an outcome-only check?
Precondition violation	Yes — rejected on the spot	Yes	Yes (it's obvious)
Latent failure	No — runs clean	Yes, silently	Often no
Irreversible failure	No	Yes, and unrecoverable	Often no

The fix the paper studies is counterfactual foresight: before committing a step, the planner pauses to simulate its effect on the world state — "if I thaw this now, what becomes true at the end?" — and checks the predicted end state against the goal. The silent error stops being silent, because the simulated end state shows the dish is ruined before the cook ever commits the step. This is the same lesson the evals modules press: grading only the final plate can hide the mid-plan mistakes that compound into a quiet failure.

Put numbers on it. Imagine 100 plans from a frontier model: only about 17 come back error-free, and as many as 56 of the 100 hide a latent failure that never threw. Turn on counterfactual foresight and that 56 drops by up to 72% — to roughly 16 plans (illustrative on that scaling) — while irreversible failures fall by up to 75%. The headline isn't that models can't plan; it's that most of their broken plans look fine until you simulate the world they act on.

Goes deeper in: AI Agents → Planning & Reflection → When to Pause and Observe

Related explainers

AdaPlanBench — replanning under hidden constraints — the sibling planning benchmark: AdaPlanBench tests whether a model replans when a hidden rule bites; SIMMER tests whether it notices its plan silently broke
WeaveBench — trajectory-aware grading — the same insight one level up: grade the whole trajectory, because an outcome that "looks like a pass" can hide a mid-run shortcut
FutureSim — harness-level agent eval — the broader move to evaluating agents on what they do across a run, not a single-shot answer

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based