The news. On June 30, 2026, Google Research introduced TabFM, a foundation model for tabular classification and regression that works on previously unseen tables in a single forward pass. It ships on Hugging Face and GitHub, and Google says it will land in BigQuery behind an
AI.PREDICTSQL command. It was measured on TabArena across dataset sizes from 700 to 150,000 samples. Read the release →
Picture a fill-in doctor who starts a shift by reading the ward's whole patient log. Each past row is a solved case — symptoms in one set of columns, the confirmed diagnosis in another. At the bottom sits a new patient with the diagnosis column blank. The doctor does not go back to medical school for this ward; they predict the blank by reading the log they were just handed. That is the whole idea behind TabFM: as Google puts it, the model "takes the entire dataset — comprising both the historical training examples and the target testing rows — as a single unified prompt." This is in-context learning — the same trick where a language model answers better after you paste a few examples into the prompt — applied to a spreadsheet.
That reframing matters because it flips the usual tabular workflow. Normally a new table means a new model: you fit an XGBoost model or train a small neural net on that specific spreadsheet, then throw it away for the next one. TabFM does no per-table training at all — one pretrained model reads the table and predicts, so the "training run" disappears into a single forward pass.
| Aspect | Train a model per table (XGBoost, bespoke net) | TabFM — in-context [Google] |
|---|---|---|
| A new table | fit a fresh model — its own training run | one forward pass, no training |
| Learns from | gradient updates on that one table | the table itself, read as a prompt |
| Reuse | one model serves one table | one model, previously unseen tables |
But a table is not a sentence, and that raises a real question: in what order do you read it? A log has no natural first or last patient, and two directions carry meaning at once — rows are examples, columns are features. So TabFM attends both ways in alternating passes: row attention compares the new case against similar past ones (which patients look like this one?), and column attention weighs which fields move together (which vitals track the diagnosis?). Reading down the columns is like how self-attention lets a token look across a whole sequence — only here it runs across features and across examples, not left-to-right words.
The catch is compute. If you flatten every cell into one long prompt, attention has to compare every cell with every other, and that bill explodes. TabFM's fix is row compression: it first boils each row down to a single dense vector, then runs the main Transformer over that short sequence of row-summaries instead of the full grid.
Watch the numbers on a modest table: 1,000 example rows and 100 columns (illustrative). Flatten it and you get 1,000 × 100 = 100,000 cells in one prompt; attention scales with the square, so that is roughly 100,000² = 10 billion pairwise comparisons. Compress each row to one vector first and the expensive sequence the Transformer reasons over is just the 1,000 rows — about 1,000² = 1 million comparisons, a ~10,000× cut at that stage (illustrative). Same table, a fraction of the compute — which is what makes "read the whole dataset as a prompt" affordable in the first place.
One more piece explains how a model can be good at a table it has never seen: it never trained on your data, but it trained on the shape of data. TabFM pretrains on hundreds of millions of synthetic tables generated by structural causal models — little cause-and-effect recipes that spin up realistic feature relationships. Meeting millions of these, the model learns the general grammar of "features predict a target," which is why a fresh spreadsheet is just one more example of a pattern it already understands. On the TabArena benchmark — 38 classification and 13 regression datasets, sizes from 700 to 150,000 samples, with a 32-way ensemble for its strongest configuration — that pretrained grammar is what lets a single forward pass compete without a single gradient step on the target table.
Goes deeper in: LLM Internals → Self-Attention → Computing Attention Scores
Related explainers
- ViQ — text-aligned visual tokens — the same move for images: reshape a non-text modality into the token/prompt paradigm a Transformer already understands. TabFM does it for tables.
- MiniMax M3 — MSA block-sparse attention — another rethink of what attention runs over; TabFM's row-and-column attention is a different answer to "attention across which axis?".
- Gemma 4 12B — encoder-free multimodal — folding a new modality straight into one model, no bolted-on encoder — the same spirit as reading a whole table with one pretrained network.