Latent Space

Databricks' Zaharia and Xin on why data, not the model, is the moat

Matei Zaharia and Reynold Xin· Co-founders, Databricks at Databricks
·~70 min·English·Latent Space
AgentsAI InfrastructureOpen SourceBusiness Strategy
TL;DR

Databricks' co-founders on their one bet — that owning the data and the agent substrate, not the frontier model, is the durable moat in the AI era.

01Core Thesis

"Slap Some AGI on Top" — But Data First

Databricks' whole bet is that once your data sits in the right place, today's generic agents are already good enough to rewrite most software on top of it.

I think many of the traditional software will be sort of rewritten with this new paradigm which is just get the data to be there and then let's slap some AGI on top. Magic will come out.

Reynold Xin, Latent Space
Key Insight
The provocation hides the real claim: the model is now the commodity and the data plumbing is the moat. If intelligence is something you 'slap on top,' then whoever owns the substrate underneath captures the value — which is exactly the layer Databricks sells.

02Agent Infrastructure

Agent Cloud: One API Over Every Harness

Omnigent maps Claude Code, Codex, Cursor and raw SDKs to a single session interface, so teams can swap models and harnesses without rewriting their tooling.

Claude code, running in a terminal, Codex, you know, Phi, OpenAI SDK, all that stuff. We map them all to that same interface.

Matei Zaharia, Latent Space
Key Insight
This is the operating-system move applied to agents: standardize the syscall so the churn above it (a new model every few months) stops breaking everything below it. Whoever owns that interface owns where security, collaboration and billing get enforced.

03Origin Story

The Dark Ages of Programming

The push for persistent cloud sandboxes came from a co-founder literally checking his coding-agent sessions at red lights because closing his laptop would kill them.

I was tethering my laptop to my phone. Keeping on the side, whenever I hit a red light, I started looking at what's going on on my laptop. And I just felt that was ridiculous.

Reynold Xin, Latent Space
Key Insight
The tell is that agents got smart faster than the runtime around them grew up. A model that can code for an hour is useless if it dies when you shut your laptop — so the next frontier isn't a better model, it's boring infrastructure: persistence, sharing, and access control.

04Agent Security

Judge the Session, Not the Call

Blunt allow/deny rules can't catch an agent that reads secrets and then exfiltrates them, so Databricks tracks per-session state and raises the bar as risk accumulates.

the thing we decided we need is stateful or what we call contextual policies, where you keep track of the state of that session.

Matei Zaharia, Latent Space
Key Insight
This is the lethal-trifecta problem — access to private data, exposure to untrusted content, and a way to exfiltrate — made operational. No single permission is dangerous; the danger is the sequence, so the policy engine has to be stateful or it is theater. The same session state also caps spend: a runaway debug that burned $500 becomes a $5 sub-agent budget.

05Data Architecture

L-TAP: Unify Storage, Not the Query Engine

Instead of the impossible single OLTP-plus-OLAP engine, Databricks shares one open storage layer and transcodes row data into Parquet on idle CPUs, so analytics reads it with no pipeline.

We think you can get 99% of what you need by unifying the storage. And just have a single storage layer.

Reynold Xin, Latent Space
Key Insight
The insight is knowing which layer to merge. Chasing one engine for both workloads (true HTAP) has failed for decades because throughput and latency pull against each other. Merging only the storage kills the CDC pipeline — the 'continuous data corruption' that wakes engineers at 3am when a schema change breaks it — while leaving each query engine specialized. You get most of the dream and skip the compromise.

06Database Engineering

A Factory That Builds the Database

Rather than hand-picking algorithms from papers, Databricks trained a model on a decade of query traces that predicts how any plan will perform and routes to the best one at runtime.

the factory takes the decade of traces we have. I think they count as a quadrillion data points in the trace table.

Reynold Xin, Latent Space
Key Insight
Note it is deliberately not an LLM — it's a classical machine-learning model over structured traces, because the task is precise cost prediction, not language. This is the data-moat thesis turned inward: the decade of traces they already own becomes the training set no new entrant can match, so the moat compounds.

07Model Strategy

Specialize, Don't Chase the Frontier

After buying Mosaic, Databricks chose not to train frontier models and instead ships small specialized ones — a document parser roughly 100x cheaper than frontier models, and better.

the same model generates training environments and trains itself and beats like Opus and GPT 5.5 and stuff at a task.

Matei Zaharia, Latent Space
Key Insight
The frontier is a scale race you win by burning compute; specialization is a data-and-workflow race you win by owning the task. As base models get smarter they generate better training traces, so the cost of a purpose-built model keeps dropping — pushing customization from 'needs a hardcore researcher' toward 'anyone can describe a task.'

08Competitive Strategy

Start Open, Start Large, Start Upstream

Databricks beat Snowflake by starting with open formats and bulk ingestion upstream, then flowing downstream into fast serving — the easier direction to travel.

start open and start large. Like in some sense, we started upstream of them.

Reynold Xin, Latent Space
Key Insight
Direction of travel is destiny. Going from cheap, open, large-scale batch down to fast, curated serving is a matter of optimization; going the other way means re-architecting for openness and scale you never designed for. Openness wasn't a values flourish — it was the choice that let the water flow their way.