The news. On June 17, 2026, Cisco Foundation AI released FAPO: Fully Autonomous Prompt Optimization. The idea is to let Claude Code optimize an LLM pipeline inside a standardized codebase: it evaluates the pipeline, inspects the intermediate steps, diagnoses where a failure originates, proposes a scoped change, and validates the variant against a score function — trying prompt edits first and rewriting chain structure only when attribution pins a structural bottleneck. FAPO beat the GEPA baseline in 15 of 18 model-benchmark comparisons, a +14.1-percentage-point mean over its 11 statistically significant wins. Read the paper →
Picture a mechanic facing an engine that runs rough. The amateur move is to swap a part, take it for a spin, swap another, spin again — mutate and hope. The professional plugs in a scanner first, reads the trouble code, and learns which stage is at fault before touching a single bolt — then makes the cheapest fix that addresses that stage. That gap — diagnose first vs. mutate blindly — is exactly the gap between two kinds of evaluator-optimizer loop. A multi-step LLM pipeline is the engine: several prompt stages chained together, each feeding the next, so when the final answer scores low the failure could be hiding in any of them.
FAPO is the scanner-first mechanic. It runs five steps in a loop — evaluate the pipeline, inspect the intermediate outputs, diagnose the failure source (an automatic error analysis the optimizing agent runs over the pipeline's intermediate outputs), propose a scoped edit, and validate against the score function. The optimizing agent is Claude Code, working over the pipeline's own codebase — it edits the real prompts and code, not an abstract spec. Crucially, it escalates in order: tweak a prompt first, and rewrite the chain's structure only when attribution says the bottleneck is structural. In the authors' words:
"It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck."
That is the agent-engineering discipline of treating prompts as code turned into an automatic optimization loop you could point at a shadow or A/B harness.
Why does diagnosing first beat mutating blindly? Hold the benchmark suite fixed at 18 model-benchmark comparisons: FAPO won 15 of them. Average just the 11 statistically significant wins, and the gain is +14.1 percentage points over GEPA. Now isolate the subset where FAPO actually changed chain structure — six benchmarks (HoVer, IFBench) — and the picture sharpens: it won all six, with a +33.8 percentage-point mean. (The biggest gains land exactly where structural editing was the right diagnosis — the escalation ladder paying off.) Even on security tasks (CTIBench-RCM) the scoped edits help: +4.0pp on GPT-5, +7.1pp on Foundation-Sec-8B-Instruct, +2.0pp on the reasoning variant.
| Approach | How it changes the pipeline | What guides the change | Reported result |
|---|---|---|---|
| GEPA (baseline) | Mutates prompts, keeps higher-scoring variants | A score signal alone — no explicit failure attribution | The baseline FAPO is measured against |
| FAPO (failure-attribution-gated) | Scoped prompt edit first; chain-structure rewrite only on a structural diagnosis | Per-step failure attribution from inspecting intermediate outputs | Won 15/18 comparisons; +14.1pp mean over 11 significant wins |
Goes deeper in: AI Agents → Workflow Patterns → Evaluator-Optimizer and Agent Engineering → Deployment & Rollout → Prompts as Code
Related explainers
- Crafter — multi-agent refinement harness with a directive critic — an iterate-and-critique cousin; FAPO's "critic" is failure attribution
- Maestro — RL orchestrator over frozen experts — also improves a pipeline without touching the base model's weights
- Agent-harness scaling law: feedback quality predicts success — FAPO's edge comes from higher-quality attribution feedback, not more compute