What is failure-attribution-gated prompt optimization?

It's optimizing a multi-step LLM pipeline by first diagnosing which stage caused a failure, then making the smallest edit that addresses that stage. FAPO (arXiv 2606.19605) runs a loop — evaluate, inspect intermediate outputs, attribute the failure, propose a scoped change, validate against a score function — with Claude Code as the optimizing agent. It tries prompt edits first and only rewrites the chain's structure when attribution identifies a structural bottleneck.

A multi-step pipeline (retrieve → reason → answer) often fails at a specific stage or in the interactions between stages, but blind optimizers can't tell where, so they waste trials mutating the wrong thing. By attributing failure first, FAPO spends its edits where they count — and the payoff shows up most where it changed chain structure (a +33.8pp mean on those six benchmarks). It turns prompt-as-code pipeline tuning into a measurable, automatable step.

How does it relate to GEPA?

GEPA is the baseline: a Genetic-Pareto prompt optimizer that mutates prompts and keeps higher-scoring variants — evolutionary search with no explicit diagnosis. FAPO adds failure attribution and a prompt-edit-first, restructure-only-when-needed ladder. Across 18 model-benchmark comparisons FAPO beat GEPA in 15, with a +14.1pp mean over its 11 statistically significant wins.

FAPO auto-optimizes multi-step LLM pipelines, beating GEPA on 15 of 18 benchmarks — Failure-attribution-gated prompt optimization

TL;DR

What is it: A new paper, FAPO (Fully Autonomous Prompt Optimization, arXiv 2606.19605), has Claude Code optimize a multi-step LLM pipeline end to end — and the article's focus is its method: failure-attribution-gated prompt optimization, diagnosing where a pipeline fails before editing it.
Why it’s needed: Multi-step LLM pipelines (retrieve → reason → answer) often fail at a specific stage — or in the handoffs between stages — and you can't easily tell where; FAPO matters because it locates the weak point and makes the smallest fix that addresses it, turning pipeline optimization from guesswork into something you can run on production evals and prompt-as-code deployment.
vs previous: The previous approach, GEPA, mutates prompts and keeps whatever scores higher — an evolutionary search with no diagnosis; FAPO adds failure attribution plus a prompt-edit-first, restructure-only-when-needed ladder, fixing the wasted trials of blind mutation.

Jargon

FAPO: Fully Autonomous Prompt Optimization. A system that lets an LLM agent — here, Claude Code — optimize another LLM pipeline by itself: evaluate, diagnose, edit, validate, in a loop. arXiv 2606.19605.
GEPA: The baseline FAPO is measured against. A Genetic-Pareto prompt optimizer that mutates prompts and keeps the higher-scoring variants — evolutionary search, with no explicit failure diagnosis.
Multi-step LLM pipeline: A task split across several chained prompt stages (e.g. retrieve → reason → format), each feeding the next; the final score depends on every stage doing its part.
Failure attribution: Figuring out which stage of a multi-step pipeline actually caused a wrong answer — the diagnosis step that tells the optimizer where to edit.
Chain structure: The shape of the pipeline itself — how many stages, how they're wired — as opposed to the text inside any one prompt. Changing it is the bigger, riskier edit.
Scoped edit: A change confined to a permitted boundary — one prompt, or one structural rewrite — so the optimizer never rewrites the whole system at once.
Score function: The objective metric each candidate variant is validated against (task accuracy on a benchmark); the loop keeps a change only if the score improves.
Percentage point (pp): The absolute gap between two percentages — going 70% → 84% is +14pp, not +20%. FAPO's headline gain over GEPA is reported in pp.

The news. On June 17, 2026, Cisco Foundation AI released FAPO: Fully Autonomous Prompt Optimization. The idea is to let Claude Code optimize an LLM pipeline inside a standardized codebase: it evaluates the pipeline, inspects the intermediate steps, diagnoses where a failure originates, proposes a scoped change, and validates the variant against a score function — trying prompt edits first and rewriting chain structure only when attribution pins a structural bottleneck. FAPO beat the GEPA baseline in 15 of 18 model-benchmark comparisons, a +14.1-percentage-point mean over its 11 statistically significant wins. Read the paper →

Picture a mechanic facing an engine that runs rough. The amateur move is to swap a part, take it for a spin, swap another, spin again — mutate and hope. The professional plugs in a scanner first, reads the trouble code, and learns which stage is at fault before touching a single bolt — then makes the cheapest fix that addresses that stage. That gap — diagnose first vs. mutate blindly — is exactly the gap between two kinds of evaluator-optimizer loop. A multi-step LLM pipeline is the engine: several prompt stages chained together, each feeding the next, so when the final answer scores low the failure could be hiding in any of them.

FAPO is the scanner-first mechanic. It runs five steps in a loop — evaluate the pipeline, inspect the intermediate outputs, diagnose the failure source (an automatic error analysis the optimizing agent runs over the pipeline's intermediate outputs), propose a scoped edit, and validate against the score function. The optimizing agent is Claude Code, working over the pipeline's own codebase — it edits the real prompts and code, not an abstract spec. Crucially, it escalates in order: tweak a prompt first, and rewrite the chain's structure only when attribution says the bottleneck is structural. In the authors' words:

"It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck."

That is the agent-engineering discipline of treating prompts as code turned into an automatic optimization loop you could point at a shadow or A/B harness.

Why does diagnosing first beat mutating blindly? Hold the benchmark suite fixed at 18 model-benchmark comparisons: FAPO won 15 of them. Average just the 11 statistically significant wins, and the gain is +14.1 percentage points over GEPA. Now isolate the subset where FAPO actually changed chain structure — six benchmarks (HoVer, IFBench) — and the picture sharpens: it won all six, with a +33.8 percentage-point mean. (The biggest gains land exactly where structural editing was the right diagnosis — the escalation ladder paying off.) Even on security tasks (CTIBench-RCM) the scoped edits help: +4.0pp on GPT-5, +7.1pp on Foundation-Sec-8B-Instruct, +2.0pp on the reasoning variant.

Approach	How it changes the pipeline	What guides the change	Reported result
GEPA (baseline)	Mutates prompts, keeps higher-scoring variants	A score signal alone — no explicit failure attribution	The baseline FAPO is measured against
FAPO (failure-attribution-gated)	Scoped prompt edit first; chain-structure rewrite only on a structural diagnosis	Per-step failure attribution from inspecting intermediate outputs	Won 15/18 comparisons; +14.1pp mean over 11 significant wins

Goes deeper in: AI Agents → Workflow Patterns → Evaluator-Optimizer and Agent Engineering → Deployment & Rollout → Prompts as Code

Related explainers

Crafter — multi-agent refinement harness with a directive critic — an iterate-and-critique cousin; FAPO's "critic" is failure attribution
Maestro — RL orchestrator over frozen experts — also improves a pipeline without touching the base model's weights
Agent-harness scaling law: feedback quality predicts success — FAPO's edge comes from higher-quality attribution feedback, not more compute

Related explainers

Frequently Asked Questions