The news. In June 2026, researchers posted a large-scale empirical study across 67 frontier models from 21 providers that defines a co-failure ceiling of 1 − β, where β is the rate at which every model is wrong on the same query. The headline result: when models co-fail, routing, voting, and mixture-of-agents do not improve accuracy — the only gains come from models that fail on different questions. The paper also shows that pairwise error correlation ρ cannot reliably predict β, and that low-correlation heterogeneous ensembles beat high-correlation Self-MoA at matched quality. Read the paper →
Picture a panel of doctors giving second opinions. The instinct is comforting: add another doctor, then another, and surely the panel catches more. But if every doctor on the panel trained at the same school, they share the same blind spots — and on the rare disease they were all taught to overlook, a fifth or sixth opinion changes nothing. That panel of doctors is your set of models, and a vote or a router that combines their answers can only help on the cases where someone on the panel was right.
This is exactly the limit the paper makes precise. Define β as the co-failure rate — the share of questions on which the whole set of models is wrong at once. For any combiner that picks among the models' own answers — routing to the best model or taking a majority vote — there is simply no correct answer in the pool to choose on those questions, so it cannot score above 1 − β: the co-failure ceiling. A mixture-of-agents aggregator that fuses the drafts could in principle synthesize a right answer from wrong ones, but the paper finds that in practice it does not beat the ceiling either. Combining models can claw back the questions where they disagree; it rarely recovers the questions where they all agree on the wrong answer.
The trap is in how people estimate that shared-failure rate. The usual shortcut is pairwise error correlation ρ — how often any two models are wrong together — and it feels like a reasonable proxy. It isn't. The paper shows ρ systematically under-counts β, because two-at-a-time agreement says little about the rare event where all of them miss at once. Measure the joint failure directly, with finite-sample (Clopper–Pearson) bounds, and the picture is far less rosy than the correlation heuristic implies.
Hold one benchmark fixed and walk the numbers. On the math task, across all 67 models, the co-failure rate was measured at β = 0.052 — about 1 question in 20 defeats every model at once. That pins the ceiling at 1 − β = 0.948: no router, vote, or mixture built from these models can exceed roughly 94.8% accuracy. Now the trap: estimate β from pairwise correlation ρ instead and you'd predict only 0.023, about 1 in 43. The real co-failure rate was 2.5× higher (90% confidence interval 1.7–3.4×). On harder tasks the gap widens — code execution β = 0.079, and GPQA-Diamond free-response β = 0.127, roughly 1 question in 8. So the questions your ensemble can never recover are about two-and-a-half times more common than the standard correlation shortcut suggests.
| Task | Co-failure rate β | ≈ how often all models miss | Ceiling (1 − β) | Source |
|---|---|---|---|---|
| Math | 0.052 | ~1 in 20 | ~0.948 | arXiv |
| Code execution | 0.079 | ~1 in 13 | ~0.921 | arXiv |
| GPQA-Diamond (free response, 5-judge panel) | 0.127 | ~1 in 8 | ~0.873 | arXiv |
| Math, predicted from error correlation ρ | 0.023 | ~1 in 43 (an under-count) | ~0.977 (false) | arXiv |
So what actually lifts the ceiling? Not more copies of your best model — back to the panel, that is just the same doctor consulting himself. The only lever is diversity: models with genuinely different blind spots, which fail on different questions and therefore shrink the set they all miss together. The paper finds that low-correlation heterogeneous ensembles beat high-correlation Self-MoA at matched quality, and that this is the same reason a router or a voting agent team sometimes helps and sometimes doesn't. It also reframes a familiar eval failure mode: a leaderboard of correlated models can look like progress while the co-failure ceiling barely moves.
Goes deeper in: AI Agents → Workflow Patterns → Parallelization + Voting
Related explainers
- Maestro — RL orchestrator over frozen experts — one way to combine many models; the co-failure ceiling is the limit any such combiner runs into.
- VibeThinker-3B — Diversity-driven RL — diversity as the lever again, but inside a single model's training rather than across an ensemble.
- Agent leaderboards mislead under distribution shift — why aggregate scores over-promise; a correlated leaderboard can hide a stuck co-failure ceiling.