The news. In June 2026, researchers posted SpatialClaw to arXiv. It gives a VLM-backed agent a code action interface for open-ended 3D/4D spatial reasoning: a persistent Python kernel is pre-loaded with the input frames plus perception and geometry primitives, and the agent writes one executable cell at a time, conditioning each on all prior text and visual outputs. Across 20 spatial-reasoning benchmarks it averages 59.9%+11.2 points over the prior spatial agent — with consistent gains across six VLM backbones and no training. Read the paper →

Picture a surveyor sent to measure a site they have never seen. Hand them a fixed checklistrecord the width, the height, the door count — and they can only report what the form asks; an angle the form forgot is simply unmeasurable. Hand them a blank survey form to fill in up front and they must guess the whole sequence of measurements before taking a single one. What actually works is the surveyor at the site: take one reading, look at it, and let it decide the next — every measurement written into a notebook the next reading can build on. That notebook is the agent's stateful kernel, each reading is one line of code, and "look at it, then decide" is the observe-then-act loop.

Underneath the metaphor, SpatialClaw is making a claim about an agent's action interface — the how-it-acts, not the what-model. The dominant interface today is structured tool-calls: the agent chooses a named function from a fixed schema and fills in typed arguments. It is dependable, but a fixed schema can only ever compose the operations it already lists, and open-ended spatial questions rarely decompose into the schema the designer guessed in advance. The other option, single-pass code, is more expressive but commits the entire script before any result comes back — one wrong assumption early and the whole analysis is wasted.

Code-as-action keeps the expressiveness of code but adds the one thing the loop needs: feedback. The agent emits a single Python cell against a stateful kernel, runs it, reads the intermediate text and image it produced, and only then writes the next cell — so it can detect that a depth estimate looks off and recompute, or chain a detection into a measurement into a comparison without any of those being a predefined tool. Because the change lives entirely in how the agent acts rather than in the weights, it is training-free and transfers across backbones; the same loop that a fixed schema would have flattened into one blind shot becomes a stepwise plan the agent revises as it goes.

Action interfaceHow the agent actsCan it adapt mid-task?
Structured tool-callspick a named function from a fixed schemaonly within the preset operations
Single-pass codewrite one whole script, run it onceno — commits before any result
Code-as-action (SpatialClaw)one executable cell per step on a stateful kernelyes — observes each result, then writes the next

Where it earns its keep

Picture one illustrative spatial question that needs three operations composed in order: detect the two objects, measure the distance between them, then compare that distance to a reference. With structured tool-calls, those three only compose if the schema happens to expose a detect → measure → compare chain; miss one and the run dead-ends. Single-pass code can write all three at once, but if the detect step misreads an object — which the agent can't know until it runs — the measure and compare built on top inherit the error and the final answer is wrong. Code-as-action runs detect, looks at the result, fixes it if the boxes are off, and only then measures and compares — three steps, each conditioned on the last. (The three-step example is illustrative; only the 59.9% average, +11.2-point gain, 20 benchmarks, and six backbones come from the paper.) The reported +11.2 points to 59.9%, holding across six different VLMs with no training, is what that per-step feedback buys.

Goes deeper in: AI Agents → Tool Use & Function Calling → Designing the agent-computer interface

Related explainers

Continue in trackTool Use — designing the interface an agent acts through

Frequently Asked Questions