Gemini 3.5 Flash — Agent-first model design
LLMThe news. On May 25, 2026, Google DeepMind announced Gemini 3.5 Flash, the first model in the Gemini 3.5 series. The framing is explicitly agent-first: the blog pairs the model with the Antigravity harness for collaborative subagents and the Frontier Safety Framework for interpretability-based safety checks. Headline scores include Terminal-Bench 2.1 at 76.2%, MCP Atlas at 83.6%, and GDPval-AA at 1656 Elo, with a claim of roughly 4× the output tokens / second of competing frontier models. Architectural details — parameter count, context window, training recipe — are not disclosed.
Picture the metaphor for a moment. A tourist with a phrasebook can order coffee. They flip to the right page, read the phrase slowly, get the syllables roughly right. When the barista says something back, they look it up. When the response is unexpected — "what size?" — they fumble. They get there, but every interaction is a discrete look-up, and the cost of failure is a re-look-up. A native speaker doesn't translate; they hear the question and respond in the same beat, and when something unexpected lands they handle it without dropping the thread. That is the gap between a chat model with a tool-use harness and an agent-first model. The chat model speaks the language of tool calls one phrase at a time through a wrapper; the agent-first model speaks it natively because it learned in that environment.
What changes in training — at least in the version of the story the public framing implies — is the substrate. A chat-tuned base model has seen billions of words of human dialogue and roughly nothing of tool-call traces; function-calling is typically taught at fine-tune time as a structured-output discipline. Agent-first models, by contrast, are characterized by tool-call traces — call, observation, next call, error, recovery, success — appearing in heavy post-training (and in some cases earlier). The model's prior on "what happens next" after a 500 response is no longer "be a helpful chatbot," it's "retry with backoff or pick a different tool." Google does not disclose Gemini 3.5 Flash's training recipe, so the loss-shaping argument here is an interpretation of the agent-first product positioning (Antigravity, MCP Atlas benchmarks, Frontier Safety Framework framing), not a quoted architectural claim.
The under-appreciated piece is the harness architecture on the other side. A chat-with-tools deployment needs an aggressive harness — JSON-schema validators that reject hallucinated function names, retry wrappers that catch tool errors and rephrase them as user messages, a planner module that re-prompts on stuck loops. Each of those layers exists because the model itself does not natively know it is inside a loop; the harness has to keep telling it. As models move agent-first, harness mass tends to shift back into the model: fewer parsers, fewer retry wrappers, simpler observability spans because each turn is shorter and the compounding error rate per turn is lower. The harness becomes thin — it shuttles inputs and outputs, it does not police behaviour.
What "agent-first" actually changes — line by line
| Behaviour | Chat model + tool-use harness | Agent-first model |
|---|---|---|
| Function name correctness | Hallucinated names appear; harness rejects them and re-prompts (setup-dependent, illustrative) | Function names are part of the training distribution — closer to the corpus, hallucinated less |
| Argument-shape correctness | JSON schema violations on first attempts — harness catches and retries (setup-dependent, illustrative) | Structured outputs are native; shape errors fall off — see structured outputs |
| Tool-error recovery | Treats 500s as conversational surprise; may re-ask the user | Treats 500s as in-loop signal: backoff, alternative tool, fail-task |
| Multi-step planning horizon | Plans 1–3 turns ahead; long horizons drift | Trained against trajectories of 10+ tool calls; horizon stays coherent |
| Harness complexity | Heavy: parsers, validators, retry wrappers, planner modules | Thin: dispatch tool calls, format observations |
| Headline pitch | "Use this chat model for agents (with these wrappers)" | "This model is for agents" — e.g. Gemini 3.5 Flash, Claude computer-use models |
Where the per-turn savings actually come from
A back-of-envelope walk-through (illustrative numbers; substitute your own task for a real plan). Suppose a task needs 4 distinct tool calls to complete — fetch a user, read a permission policy, write an audit log, return a response. A chat-with-tools model with a per-turn tool-call accuracy of ~85% on a complex schema will, by compounding, succeed on the 4-step trajectory ~52% of the time on the first attempt (0.85⁴ ≈ 0.52). Every miss triggers harness-level retry — an extra 2–4 turns to get back on the rails. The expected turn count balloons to roughly 8–11 turns.
Now the agent-first version. If post-training on tool-call traces lifts per-turn accuracy to ~95%, the 4-step success rate rises to ~81% (0.95⁴ ≈ 0.81). Expected turn count drops to roughly ~5 turns — about 2× fewer turns per task. Combine that with Google's reported ~4× output tokens / second at the serving layer and the end-to-end wall-clock improvement is multiplicative — fewer turns and faster turns — even though each individual improvement is modest. That is the agentic-throughput story the Antigravity framing is pointing at, not raw single-shot benchmark wins.
The catch, and the reason agent-first is not a free win: training trajectories of 10+ tool calls is expensive. The traces need to be either synthesized in a closed-loop sandbox or harvested from a deployed harness, and either path adds infrastructure that pure chat post-training did not need. The serving cost story is also fragile — the ~4× tokens / second claim is not paired with public benchmark methodology in the Google blog, and the architecture that delivers it is not disclosed. Treat the throughput number as a directional headline rather than a guaranteed contract, the same way Jetson Thor's "7.5× compute" framing crossed precisions in the edge Blackwell explainer.
Goes deeper in: AI Agents → The Agent Loop & State → Harness anatomy
Related explainers
- Tool-router contextual bandit — what the harness can still do for an agent-first model: choose the cheapest viable tool per turn, rather than burning the model's planning budget
- Pantheon-bench — HITL vs autonomous coding — the eval side of agent-first: trajectories matter more than single-turn scores
- MCP SEP-2663 — async task handles — what the transport layer looks like when the model on the other side actually expects to be in a loop