What is tabular in-context learning?

It is predicting a value in a spreadsheet by reading the whole table — past labeled rows plus the new row — as a single prompt, with no training on that table. Google's TabFM (announced June 30, 2026) works this way: it 'takes the entire dataset as a single unified prompt' and produces a zero-shot prediction in one forward pass, the way a language model answers better when you paste a few examples into its context.

How does TabFM handle a table with no natural order?

A table is two-dimensional and orderless, so TabFM uses alternating row and column attention. Row attention compares the target row against the historical example rows; column attention models how features relate. It also compresses each row into a dense vector and runs the main Transformer over that short sequence of row-summaries rather than over every cell, which sharply reduces compute versus attending over the raw grid.

How can TabFM be accurate on a dataset it never trained on?

It never trained on your data, but it pretrained on the shape of data: hundreds of millions of synthetic tables generated by structural causal models — small cause-and-effect recipes. That teaches the general pattern of 'features predict a target,' so a new spreadsheet is just another instance of something it has seen millions of times. On the TabArena benchmark (38 classification and 13 regression datasets, 700 to 150,000 samples), this lets a single forward pass compete without any gradient step on the target table.

Google releases TabFM for zero-shot tabular prediction — Tabular in-context learning

Jargon

Foundation model: One pretrained model reused across many tasks with no task-specific retraining. Common for text and images; TabFM brings the idea to tables.
In-context learning (ICL): Learning from examples placed in the prompt at inference time, with no weight updates. The model conditions on the examples the way an LLM conditions on a few worked examples you paste in.
Zero-shot: Predicting on a dataset the model never saw during training. TabFM is zero-shot per table: a brand-new spreadsheet needs no fitting step.
Row-and-column attention: Attending along rows (the examples) and along columns (the features) in alternating passes, so a table with no natural order is modeled without forcing a false one. Also called axial attention.
Row compression: Squeezing an entire row into a single dense vector before the main Transformer runs — so it reasons over one token per row instead of one per cell, which cuts the compute sharply.
Structural causal model (SCM): A recipe of cause → effect equations used to generate fake-but-realistic tables. TabFM pretrains on hundreds of millions of these synthetic datasets.
TabArena: A benchmark suite of real tabular datasets used to rank tabular models across classification and regression.

The news. On June 30, 2026, Google Research introduced TabFM, a foundation model for tabular classification and regression that works on previously unseen tables in a single forward pass. It ships on Hugging Face and GitHub, and Google says it will land in BigQuery behind an AI.PREDICT SQL command. It was measured on TabArena across dataset sizes from 700 to 150,000 samples. Read the release →

Picture a fill-in doctor who starts a shift by reading the ward's whole patient log. Each past row is a solved case — symptoms in one set of columns, the confirmed diagnosis in another. At the bottom sits a new patient with the diagnosis column blank. The doctor does not go back to medical school for this ward; they predict the blank by reading the log they were just handed. That is the whole idea behind TabFM: as Google puts it, the model "takes the entire dataset — comprising both the historical training examples and the target testing rows — as a single unified prompt." This is in-context learning — the same trick where a language model answers better after you paste a few examples into the prompt — applied to a spreadsheet.

That reframing matters because it flips the usual tabular workflow. Normally a new table means a new model: you fit an XGBoost model or train a small neural net on that specific spreadsheet, then throw it away for the next one. TabFM does no per-table training at all — one pretrained model reads the table and predicts, so the "training run" disappears into a single forward pass.

Aspect	Train a model per table (XGBoost, bespoke net)	TabFM — in-context [Google]
A new table	fit a fresh model — its own training run	one forward pass, no training
Learns from	gradient updates on that one table	the table itself, read as a prompt
Reuse	one model serves one table	one model, previously unseen tables

But a table is not a sentence, and that raises a real question: in what order do you read it? A log has no natural first or last patient, and two directions carry meaning at once — rows are examples, columns are features. So TabFM attends both ways in alternating passes: row attention compares the new case against similar past ones (which patients look like this one?), and column attention weighs which fields move together (which vitals track the diagnosis?). Reading down the columns is like how self-attention lets a token look across a whole sequence — only here it runs across features and across examples, not left-to-right words.

The catch is compute. If you flatten every cell into one long prompt, attention has to compare every cell with every other, and that bill explodes. TabFM's fix is row compression: it first boils each row down to a single dense vector, then runs the main Transformer over that short sequence of row-summaries instead of the full grid.

Watch the numbers on a modest table: 1,000 example rows and 100 columns (illustrative). Flatten it and you get 1,000 × 100 = 100,000 cells in one prompt; attention scales with the square, so that is roughly 100,000² = 10 billion pairwise comparisons. Compress each row to one vector first and the expensive sequence the Transformer reasons over is just the 1,000 rows — about 1,000² = 1 million comparisons, a ~10,000× cut at that stage (illustrative). Same table, a fraction of the compute — which is what makes "read the whole dataset as a prompt" affordable in the first place.

One more piece explains how a model can be good at a table it has never seen: it never trained on your data, but it trained on the shape of data. TabFM pretrains on hundreds of millions of synthetic tables generated by structural causal models — little cause-and-effect recipes that spin up realistic feature relationships. Meeting millions of these, the model learns the general grammar of "features predict a target," which is why a fresh spreadsheet is just one more example of a pattern it already understands. On the TabArena benchmark — 38 classification and 13 regression datasets, sizes from 700 to 150,000 samples, with a 32-way ensemble for its strongest configuration — that pretrained grammar is what lets a single forward pass compete without a single gradient step on the target table.

Goes deeper in: LLM Internals → Self-Attention → Computing Attention Scores

Related explainers

ViQ — text-aligned visual tokens — the same move for images: reshape a non-text modality into the token/prompt paradigm a Transformer already understands. TabFM does it for tables.
MiniMax M3 — MSA block-sparse attention — another rethink of what attention runs over; TabFM's row-and-column attention is a different answer to "attention across which axis?".
Gemma 4 12B — encoder-free multimodal — folding a new modality straight into one model, no bolted-on encoder — the same spirit as reading a whole table with one pretrained network.

Frequently Asked Questions

Check what you knowMap your AI & GPU knowledge across every track — free, role-based