The news. On June 12, 2026, researchers released SIMMER, a planning benchmark built on a human-curated symbolic kitchen world model — 77 actions, 262 objects, ~46,800 realistic interactions. Driving six frontier LLMs through it, they found only 17% of plans are error-free, and up to 56% contain a "latent failure": a step that executes without halting but silently undermines the goal. A state-machine executor validates each step against the world model, and adding explicit counterfactual foresight sharply cuts the silent errors. Read the paper →

Picture a line cook working a recipe. Every step is physically doable — chop, sear, simmer, plate — so nothing ever throws an error; the timer dings and the plate goes out. But somewhere in the middle the cook salted twice, or thawed something that had to stay frozen. The meal is quietly ruined and no alarm ever sounded. That is a latent failure: the plan ran to completion, every step "succeeded," and the goal still failed — which is exactly the kind of error a single-shot plan is most likely to bury.

SIMMER catches it by giving the kitchen a symbolic world model — an explicit record of the real state after every action: what's thawed, what's in each pot, what can no longer be undone. A state-machine executor then walks the plan one step at a time and sorts what goes wrong into three kinds. Only the first kind — a precondition violation, like sautéing in an empty pan — fails loudly; the other two can slip past a check that only asks "did it crash" or "did the final answer match."

Failure kindDoes it halt execution?Does the goal fail?Caught by an outcome-only check?
Precondition violationYes — rejected on the spotYesYes (it's obvious)
Latent failureNo — runs cleanYes, silentlyOften no
Irreversible failureNoYes, and unrecoverableOften no

The fix the paper studies is counterfactual foresight: before committing a step, the planner pauses to simulate its effect on the world state — "if I thaw this now, what becomes true at the end?" — and checks the predicted end state against the goal. The silent error stops being silent, because the simulated end state shows the dish is ruined before the cook ever commits the step. This is the same lesson the evals modules press: grading only the final plate can hide the mid-plan mistakes that compound into a quiet failure.

Put numbers on it. Imagine 100 plans from a frontier model: only about 17 come back error-free, and as many as 56 of the 100 hide a latent failure that never threw. Turn on counterfactual foresight and that 56 drops by up to 72% — to roughly 16 plans (illustrative on that scaling) — while irreversible failures fall by up to 75%. The headline isn't that models can't plan; it's that most of their broken plans look fine until you simulate the world they act on.

Goes deeper in: AI Agents → Planning & Reflection → When to Pause and Observe

Related explainers

Frequently Asked Questions