The news. On June 17, 2026, Cisco Foundation AI released FAPO: Fully Autonomous Prompt Optimization. The idea is to let Claude Code optimize an LLM pipeline inside a standardized codebase: it evaluates the pipeline, inspects the intermediate steps, diagnoses where a failure originates, proposes a scoped change, and validates the variant against a score function — trying prompt edits first and rewriting chain structure only when attribution pins a structural bottleneck. FAPO beat the GEPA baseline in 15 of 18 model-benchmark comparisons, a +14.1-percentage-point mean over its 11 statistically significant wins. Read the paper →

Picture a mechanic facing an engine that runs rough. The amateur move is to swap a part, take it for a spin, swap another, spin again — mutate and hope. The professional plugs in a scanner first, reads the trouble code, and learns which stage is at fault before touching a single bolt — then makes the cheapest fix that addresses that stage. That gap — diagnose first vs. mutate blindly — is exactly the gap between two kinds of evaluator-optimizer loop. A multi-step LLM pipeline is the engine: several prompt stages chained together, each feeding the next, so when the final answer scores low the failure could be hiding in any of them.

FAPO is the scanner-first mechanic. It runs five steps in a loop — evaluate the pipeline, inspect the intermediate outputs, diagnose the failure source (an automatic error analysis the optimizing agent runs over the pipeline's intermediate outputs), propose a scoped edit, and validate against the score function. The optimizing agent is Claude Code, working over the pipeline's own codebase — it edits the real prompts and code, not an abstract spec. Crucially, it escalates in order: tweak a prompt first, and rewrite the chain's structure only when attribution says the bottleneck is structural. In the authors' words:

"It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck."

That is the agent-engineering discipline of treating prompts as code turned into an automatic optimization loop you could point at a shadow or A/B harness.

Evaluatescore the pipeline
Inspectread each stage
Diagnoseattribute the failure
Scoped editprompt → structure
Validatekeep if score ↑

Why does diagnosing first beat mutating blindly? Hold the benchmark suite fixed at 18 model-benchmark comparisons: FAPO won 15 of them. Average just the 11 statistically significant wins, and the gain is +14.1 percentage points over GEPA. Now isolate the subset where FAPO actually changed chain structure — six benchmarks (HoVer, IFBench) — and the picture sharpens: it won all six, with a +33.8 percentage-point mean. (The biggest gains land exactly where structural editing was the right diagnosis — the escalation ladder paying off.) Even on security tasks (CTIBench-RCM) the scoped edits help: +4.0pp on GPT-5, +7.1pp on Foundation-Sec-8B-Instruct, +2.0pp on the reasoning variant.

ApproachHow it changes the pipelineWhat guides the changeReported result
GEPA (baseline)Mutates prompts, keeps higher-scoring variantsA score signal alone — no explicit failure attributionThe baseline FAPO is measured against
FAPO (failure-attribution-gated)Scoped prompt edit first; chain-structure rewrite only on a structural diagnosisPer-step failure attribution from inspecting intermediate outputsWon 15/18 comparisons; +14.1pp mean over 11 significant wins

Goes deeper in: AI Agents → Workflow Patterns → Evaluator-Optimizer and Agent Engineering → Deployment & Rollout → Prompts as Code

Related explainers

Frequently Asked Questions