The news. On June 18, 2026, researchers posted S-Agent to arXiv — an LLM-agent framework for spatial intelligence over multi-view images and video. Instead of reasoning frame-by-frame, a vision-language model acts as a semantic planner that directs a hierarchy of spatial tools: it grounds objects in 2-D, lifts them into 3-D, and aggregates geometric evidence across frames. The tool hierarchy already improves multiple spatial benchmarks with no training; after fine-tuning on its own traces, S-Agent-8B rivals GPT-5.4 and Gemini 3 on spatial reasoning. Read the paper →

Picture a detective walking into a room they have never seen, handed a thick stack of photos — the same space shot from a dozen angles. The hopeless way to work is to riffle through the photos and try to picture the whole room: where the chair sits relative to the door, how far the table is from the window, which way the lamp is turned. Flat photos do not carry that, and the mental image slips with every page. What actually works is to build a small 3-D scale model on the table, and let each photo add one measurement to it — then answer every question by looking at the model, not by re-imagining it from the stack. That scale model is Scene Memory, the detective deciding what to measure next is the VLM planner, and "add one measurement per photo" is spatio-temporal evidence accumulation.

Underneath the metaphor, S-Agent is moving the 3-D scene out of the model's context window and into an explicit store. A plain VLM asked a spatial question sees only a sequence of flat frames and must re-build the geometry in its head on every step — exactly the frame-by-frame approach that loses track as the camera moves. S-Agent instead casts the VLM as a planner that directs a hierarchy of tools, the same orchestrator-and-workers shape the Agents track names: one tool grounds each object in 2-D, another lifts it into a 3-D position, another measures. Their outputs land in Scene Memory — the running 3-D model — while the planner's own reasoning lives in Agent Memory, keeping what the world looks like separate from what the agent has done.

Because the geometry now accumulates in a store the tools update, the same loop runs training-free — it changes how the agent acts, not the weights. The contrast with a code-as-action spatial agent is instructive: both move beyond asking one VLM to answer directly from the frames, but where that agent writes executable code as its action, S-Agent routes the work to typed spatial experts and a shared 3-D memory.

ApproachWhere the 3-D scene livesSpatial-reasoning result
Large VLM answering directly (e.g. GPT-5.4)re-derived in the model's context, every framestrong, but heavyweight
Code-as-action agent (SpatialClaw)a stateful kernel the agent writes code against+11.2 pts to 59.9% across 20 benchmarks
S-Agent (planner + spatial tools + Scene Memory)an explicit 3-D model the tools refine8B rivals GPT-5.4 & Gemini 3

Where it earns its keep

Picture a four-frame clip of a kitchen and the question "how many chairs?" A frame-by-frame counter tallies sightings — say it spots 3, then 3, then 2, then 3 chairs, and with no shared model it has no way to know which sightings are the same chair seen again, so it can drift toward 11. S-Agent places each detected chair at a 3-D coordinate in Scene Memory, so re-sightings from new angles collapse onto the same point — and the count resolves to 4. (The four-frame count is illustrative; only the 8-billion-parameter scale, the training-free gains, and the GPT-5.4 / Gemini 3 parity come from the paper.) That is the whole bet of accumulating evidence into one 3-D store rather than re-reasoning each flat frame: the geometry stops slipping, and an 8B agent reaches the neighborhood of frontier models built at far larger scale.

Goes deeper in: AI Agents → Workflow Patterns → Orchestrator-Workers + Subagents

Related explainers

Continue in trackWorkflow Patterns — a planner directing a hierarchy of tool experts

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based