Gemini 3.5 Flash — Agent-first model design

LLM
L
Same task · two model styles · 11 turns vs 4 turnstool callerrorretrydoneUSER QUERYfetch alice'sprofileChat model + tool-use harness11 turns · 3 retriesAgent-first modelidle · ready for next taskDONE4 turns · 0 retriesper-task harness work — fewer turns ≈ fewer model dollars11 turns · 3 retries4 turns · 0 retries
learnaivisually.com/ai-explained/gemini-3-5-flash-agent-first-vs-chat-retrofit

The news. On May 25, 2026, Google DeepMind announced Gemini 3.5 Flash, the first model in the Gemini 3.5 series. The framing is explicitly agent-first: the blog pairs the model with the Antigravity harness for collaborative subagents and the Frontier Safety Framework for interpretability-based safety checks. Headline scores include Terminal-Bench 2.1 at 76.2%, MCP Atlas at 83.6%, and GDPval-AA at 1656 Elo, with a claim of roughly 4× the output tokens / second of competing frontier models. Architectural details — parameter count, context window, training recipe — are not disclosed.

Picture the metaphor for a moment. A tourist with a phrasebook can order coffee. They flip to the right page, read the phrase slowly, get the syllables roughly right. When the barista says something back, they look it up. When the response is unexpected — "what size?" — they fumble. They get there, but every interaction is a discrete look-up, and the cost of failure is a re-look-up. A native speaker doesn't translate; they hear the question and respond in the same beat, and when something unexpected lands they handle it without dropping the thread. That is the gap between a chat model with a tool-use harness and an agent-first model. The chat model speaks the language of tool calls one phrase at a time through a wrapper; the agent-first model speaks it natively because it learned in that environment.

What changes in training — at least in the version of the story the public framing implies — is the substrate. A chat-tuned base model has seen billions of words of human dialogue and roughly nothing of tool-call traces; function-calling is typically taught at fine-tune time as a structured-output discipline. Agent-first models, by contrast, are characterized by tool-call traces — call, observation, next call, error, recovery, success — appearing in heavy post-training (and in some cases earlier). The model's prior on "what happens next" after a 500 response is no longer "be a helpful chatbot," it's "retry with backoff or pick a different tool." Google does not disclose Gemini 3.5 Flash's training recipe, so the loss-shaping argument here is an interpretation of the agent-first product positioning (Antigravity, MCP Atlas benchmarks, Frontier Safety Framework framing), not a quoted architectural claim.

The under-appreciated piece is the harness architecture on the other side. A chat-with-tools deployment needs an aggressive harness — JSON-schema validators that reject hallucinated function names, retry wrappers that catch tool errors and rephrase them as user messages, a planner module that re-prompts on stuck loops. Each of those layers exists because the model itself does not natively know it is inside a loop; the harness has to keep telling it. As models move agent-first, harness mass tends to shift back into the model: fewer parsers, fewer retry wrappers, simpler observability spans because each turn is shorter and the compounding error rate per turn is lower. The harness becomes thin — it shuttles inputs and outputs, it does not police behaviour.

What "agent-first" actually changes — line by line

BehaviourChat model + tool-use harnessAgent-first model
Function name correctnessHallucinated names appear; harness rejects them and re-prompts (setup-dependent, illustrative)Function names are part of the training distribution — closer to the corpus, hallucinated less
Argument-shape correctnessJSON schema violations on first attempts — harness catches and retries (setup-dependent, illustrative)Structured outputs are native; shape errors fall off — see structured outputs
Tool-error recoveryTreats 500s as conversational surprise; may re-ask the userTreats 500s as in-loop signal: backoff, alternative tool, fail-task
Multi-step planning horizonPlans 1–3 turns ahead; long horizons driftTrained against trajectories of 10+ tool calls; horizon stays coherent
Harness complexityHeavy: parsers, validators, retry wrappers, planner modulesThin: dispatch tool calls, format observations
Headline pitch"Use this chat model for agents (with these wrappers)""This model is for agents" — e.g. Gemini 3.5 Flash, Claude computer-use models

Where the per-turn savings actually come from

A back-of-envelope walk-through (illustrative numbers; substitute your own task for a real plan). Suppose a task needs 4 distinct tool calls to complete — fetch a user, read a permission policy, write an audit log, return a response. A chat-with-tools model with a per-turn tool-call accuracy of ~85% on a complex schema will, by compounding, succeed on the 4-step trajectory ~52% of the time on the first attempt (0.85⁴ ≈ 0.52). Every miss triggers harness-level retry — an extra 2–4 turns to get back on the rails. The expected turn count balloons to roughly 8–11 turns.

Now the agent-first version. If post-training on tool-call traces lifts per-turn accuracy to ~95%, the 4-step success rate rises to ~81% (0.95⁴ ≈ 0.81). Expected turn count drops to roughly ~5 turns — about 2× fewer turns per task. Combine that with Google's reported ~4× output tokens / second at the serving layer and the end-to-end wall-clock improvement is multiplicative — fewer turns and faster turns — even though each individual improvement is modest. That is the agentic-throughput story the Antigravity framing is pointing at, not raw single-shot benchmark wins.

The catch, and the reason agent-first is not a free win: training trajectories of 10+ tool calls is expensive. The traces need to be either synthesized in a closed-loop sandbox or harvested from a deployed harness, and either path adds infrastructure that pure chat post-training did not need. The serving cost story is also fragile — the ~4× tokens / second claim is not paired with public benchmark methodology in the Google blog, and the architecture that delivers it is not disclosed. Treat the throughput number as a directional headline rather than a guaranteed contract, the same way Jetson Thor's "7.5× compute" framing crossed precisions in the edge Blackwell explainer.

Goes deeper in: AI Agents → The Agent Loop & State → Harness anatomy

Related explainers

Continue in trackAI Agents — The Agent Loop & State

Frequently Asked Questions