What is the co-failure ceiling?

The co-failure ceiling is the highest accuracy any combination of a fixed set of models can reach. It equals 1 − β, where β is the co-failure rate — the fraction of questions on which every model in the set is wrong at the same time. A 67-model study across 21 providers (arXiv 2606.27288, June 2026) measures β directly: on those shared-failure questions no model has the right answer, so combiners that pick among the models' answers (routing, majority voting) cannot score above 1 − β, and the paper finds that mixture-of-agents does not beat the ceiling in practice either. On the math task the measured β was 0.052 (a ceiling near 94.8%); on code execution 0.079; on GPQA-Diamond free-response 0.127.

Why can't routing, voting, or mixture-of-agents beat it?

Combining models only helps when the models disagree — it recovers questions where at least one model is right. On a co-failure, every model agrees on a wrong answer, so a router has no good model to pick and a majority vote has no correct option to surface. A mixture-of-agents aggregator that fuses the drafts could in principle synthesize a right answer, but the paper finds it does not beat the ceiling in practice. The gains come from models that fail on different questions, not from adding more models.

Why is error correlation a misleading way to predict ensemble gains?

Pairwise error correlation ρ measures how often two models are wrong together, but the co-failure rate β is about all models being wrong at once — a much rarer, joint event that two-at-a-time statistics systematically under-count. In the study, the measured β on the math task was about 2.5× higher than ρ predicted (0.052 versus 0.023, with a 90% confidence interval of 1.7–3.4×), and the gap was worse on harder tasks. The practical fix is to measure joint failures directly and to add models with genuinely different blind spots — diversity, not ensemble size, is what lifts the ceiling.

Routing, voting, and mixture-of-agents hit a shared ceiling across 67 models — Co-failure ceiling

Jargon

Co-failure rate (β): The fraction of queries on which every model in the set is wrong at the same time. The paper measures it directly per task rather than inferring it.
Co-failure ceiling (1 − β): The highest accuracy any combination of a fixed set of models can reach. If β of questions defeat all of them, no combiner can score above 1 − β.
Routing: Send each query to the single best model for it instead of running them all. One of the three combiners the ceiling bounds. See routing →
Voting / majority vote: Run several models on the same query and take the most common answer. It only helps when the wrong models disagree with each other. See parallelization + voting →
Mixture-of-agents (MoA): Combine several models' or agents' outputs, often with an aggregator that reads all their drafts. Self-MoA uses copies of the same model as every agent.
Pairwise error correlation (ρ): How often two models are wrong on the same query, measured two at a time. The paper shows it systematically under-estimates β, the rate at which they are all wrong together.
Heterogeneous vs homogeneous ensemble: An ensemble of different models (diverse blind spots) versus copies of one model. Heterogeneous, low-correlation sets push the ceiling up; homogeneous sets share the same misses.

The news. In June 2026, researchers posted a large-scale empirical study across 67 frontier models from 21 providers that defines a co-failure ceiling of 1 − β, where β is the rate at which every model is wrong on the same query. The headline result: when models co-fail, routing, voting, and mixture-of-agents do not improve accuracy — the only gains come from models that fail on different questions. The paper also shows that pairwise error correlation ρ cannot reliably predict β, and that low-correlation heterogeneous ensembles beat high-correlation Self-MoA at matched quality. Read the paper →

Picture a panel of doctors giving second opinions. The instinct is comforting: add another doctor, then another, and surely the panel catches more. But if every doctor on the panel trained at the same school, they share the same blind spots — and on the rare disease they were all taught to overlook, a fifth or sixth opinion changes nothing. That panel of doctors is your set of models, and a vote or a router that combines their answers can only help on the cases where someone on the panel was right.

This is exactly the limit the paper makes precise. Define β as the co-failure rate — the share of questions on which the whole set of models is wrong at once. For any combiner that picks among the models' own answers — routing to the best model or taking a majority vote — there is simply no correct answer in the pool to choose on those questions, so it cannot score above 1 − β: the co-failure ceiling. A mixture-of-agents aggregator that fuses the drafts could in principle synthesize a right answer from wrong ones, but the paper finds that in practice it does not beat the ceiling either. Combining models can claw back the questions where they disagree; it rarely recovers the questions where they all agree on the wrong answer.

The trap is in how people estimate that shared-failure rate. The usual shortcut is pairwise error correlation ρ — how often any two models are wrong together — and it feels like a reasonable proxy. It isn't. The paper shows ρ systematically under-counts β, because two-at-a-time agreement says little about the rare event where all of them miss at once. Measure the joint failure directly, with finite-sample (Clopper–Pearson) bounds, and the picture is far less rosy than the correlation heuristic implies.

Hold one benchmark fixed and walk the numbers. On the math task, across all 67 models, the co-failure rate was measured at β = 0.052 — about 1 question in 20 defeats every model at once. That pins the ceiling at 1 − β = 0.948: no router, vote, or mixture built from these models can exceed roughly 94.8% accuracy. Now the trap: estimate β from pairwise correlation ρ instead and you'd predict only 0.023, about 1 in 43. The real co-failure rate was 2.5× higher (90% confidence interval 1.7–3.4×). On harder tasks the gap widens — code execution β = 0.079, and GPQA-Diamond free-response β = 0.127, roughly 1 question in 8. So the questions your ensemble can never recover are about two-and-a-half times more common than the standard correlation shortcut suggests.

Task	Co-failure rate β	≈ how often all models miss	Ceiling (1 − β)	Source
Math	`0.052`	~1 in 20	~0.948	arXiv
Code execution	`0.079`	~1 in 13	~0.921	arXiv
GPQA-Diamond (free response, 5-judge panel)	`0.127`	~1 in 8	~0.873	arXiv
Math, predicted from error correlation ρ	`0.023`	~1 in 43 (an under-count)	~0.977 (false)	arXiv

So what actually lifts the ceiling? Not more copies of your best model — back to the panel, that is just the same doctor consulting himself. The only lever is diversity: models with genuinely different blind spots, which fail on different questions and therefore shrink the set they all miss together. The paper finds that low-correlation heterogeneous ensembles beat high-correlation Self-MoA at matched quality, and that this is the same reason a router or a voting agent team sometimes helps and sometimes doesn't. It also reframes a familiar eval failure mode: a leaderboard of correlated models can look like progress while the co-failure ceiling barely moves.

Goes deeper in: AI Agents → Workflow Patterns → Parallelization + Voting

Related explainers

Maestro — RL orchestrator over frozen experts — one way to combine many models; the co-failure ceiling is the limit any such combiner runs into.
VibeThinker-3B — Diversity-driven RL — diversity as the lever again, but inside a single model's training rather than across an ensemble.
Agent leaderboards mislead under distribution shift — why aggregate scores over-promise; a correlated leaderboard can hide a stuck co-failure ceiling.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based