What is adaptive solution-length control?

It's a model's ability to scale the length of its reasoning chain to the difficulty of the task. Instead of a fixed cap on reasoning tokens for every prompt, the model spends a short chain on easy tasks and a long one only on hard tasks, stopping once it has reached an answer. Microsoft's MAI-Code-1-Flash uses it to hit its benchmark scores with up to 60% fewer tokens than a flat budget would use.

Why does it save so much without losing accuracy?

Because the savings come from tasks that were over-thought, not under-thought. On an easy fix, a long reasoning chain reaches the answer early and then keeps generating tokens past it — those extra tokens cost latency and money but don't change the result. Trimming the chain to the point the answer was reached removes pure waste, while hard tasks that genuinely need a long chain are barely affected.

How is it different from a per-token compute controller?

They tune different dials. A per-token compute controller (as in the 'Compute Where It Counts' paper) changes how much compute each individual token gets — attention sparsity, layer pruning, bit-width. Adaptive solution-length control changes how many reasoning tokens the chain runs in total. One sizes the work per token; the other sizes the number of tokens. They're complementary.

Microsoft MAI-Code-1-Flash — Adaptive solution-length control

MAI-Code-1-Flash — Adaptive solution length

LLM

learnaivisually.com/ai-explained/mai-code-1-flash-adaptive-solution-length

TL;DR

What is it: Microsoft's first in-house coding model, MAI-Code-1-Flash (launched at Build 2026 alongside the MAI-Thinking-1 reasoner), ships adaptive solution-length control — the model decides how many reasoning tokens to spend based on how hard the task is.
Why it’s needed: Reasoning tokens are the dominant cost of a thinking model: every token is a decode step you pay for in latency and dollars. Spending the same long chain on a one-line fix as on a cross-file refactor wastes most of that budget — adaptive length spends it only where it buys accuracy.
vs previous: A fixed-budget reasoner thinks to roughly the same length on every prompt; adaptive control stops at the point the answer is reached, so Microsoft reports the model hitting its scores with up to 60% fewer tokens than a flat budget would burn.

Jargon

Reasoning tokens: The intermediate "thinking" tokens a model generates one at a time before its final answer (the chain-of-thought). More reasoning tokens = more decode cost.
Solution length: How long that reasoning chain runs before the model commits to an answer. Adaptive solution-length control lets the model choose this length per task instead of using a fixed cap.
Test-time compute: Compute spent at inference (not training) — chiefly by generating more reasoning tokens. Spending more usually helps hard problems and is wasted on easy ones.
SWE-Bench Pro / Verified: Benchmarks of real GitHub issues a model must resolve with working code. Microsoft reports MAI-Code-1-Flash at 51.2% on SWE-Bench Pro vs Claude Haiku 4.5's 35.2%, using up to 60% fewer tokens on SWE-Bench Verified.
Sparse MoE: Mixture-of-Experts: each token is routed through a small subset of "expert" sub-networks. MAI-Thinking-1, the reasoner alongside the coding model, is a sparse MoE with 35B active parameters and a 256K context.
Underthinking: Stopping the chain too early and committing to a wrong answer — the failure mode a fixed-minimum budget risks, and the reason a good stop signal is the hard part of adaptive length.

The news. On June 2, 2026, at Build 2026, Microsoft introduced its first in-house frontier models — MAI-Thinking-1 (a 35B-active sparse MoE reasoner with a 256K context, which Microsoft says was trained from scratch on licensed data with no distillation from third-party models) and MAI-Code-1-Flash, a small, inference-efficient coding model built end-to-end by Microsoft and rolling out to GitHub Copilot users in VS Code. MAI-Code-1-Flash reportedly leads Claude Haiku 4.5 by 16 points on SWE-Bench Pro (51.2% vs 35.2%) while using up to 60% fewer tokens, which it credits to adaptive solution-length control. Read the announcement →

Picture the test-taker for a second. Two students sit the same exam. The first was told to spend exactly ten minutes per question — so she burns the full ten on "2 + 2," sits there second-guessing a settled answer, and runs short on the proof at the end. The second reads each question, sizes up the effort, and moves on the moment she's sure — thirty seconds on the arithmetic, the full ten on the proof. Same paper, same score, far less time. Adaptive solution-length control is the second student: the model spends its reasoning where difficulty actually demands it, instead of paying a flat tax on every task.

Under the hood, the "minutes" are reasoning tokens. A thinking model generates its chain-of-thought one token at a time before answering, and every one of those tokens is a decode step you pay for in latency and dollars. A fixed budget sets one length for all prompts; adaptive control instead decides how long to keep thinking and, crucially, when to stop. Microsoft hasn't disclosed the exact controller — whether the length is learned, predicted up front, or a learned stop signal mid-chain — so treat the mechanism as undisclosed; what's reported is the outcome: the same benchmark scores at a fraction of the tokens.

Where the tokens actually go

A back-of-envelope walk-through (illustrative numbers; the 60% figure is Microsoft's). Take three Copilot tasks: an easy one-line fix, a medium multi-step bug, and a hard cross-file refactor. A fixed budget of ~2,000 reasoning tokens spends all three the same way → ~6,000 tokens total, even though the easy fix had its answer after ~200. Adaptive control stops each chain at its answer — roughly ~200 + ~650 + ~1,650 ≈ ~2,500 tokens — for the same result. That's ~58% fewer tokens in this toy mix, right in line with the up to 60% fewer Microsoft reports. The hard task barely changes; the savings come almost entirely from not over-thinking the easy and medium ones.

Three ways to set the reasoning length

Strategy	Easy task	Hard task	Main risk
Fixed-max budget	thinks far past the answer	fits — has room	over-thinking: burns tokens it doesn't need
Fixed-min budget	fits — short is fine	cut off too early	underthinking: commits to wrong answers
Adaptive control	short chain	long chain	needs a reliable stop signal

The catch lives in that last cell. A fixed budget is dumb but safe; adaptive length is only as good as its sense of when it's done. Stop one token too early on a hard task and you get underthinking — a confident wrong answer that's worse than a slow right one. That's why the headline number is a coding model's: in software, a test or verifier can often tell the model whether it's actually done, giving the stop signal something concrete to lean on. The win is real and specific — fewer reasoning tokens for the same accuracy — and it rides entirely on getting that stop right.

Goes deeper in: AI Agents → Planning & Reflection → Reasoning budget

Related explainers

Compute Where It Counts — Per-token compute controller — the other axis of adaptive compute: how much work each token gets, vs how many tokens the chain runs
LongTraceRL — Rubric reward (process supervision) — how reasoning chains get trained, where a good stop signal would come from
Gemini 3.5 Flash — Agent-first model design — a related angle: building a model for the agent loop rather than retrofitting chat

Continue in trackAI Agents — Planning & Reflection: the reasoning budget

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based