...

How the Rise of Tabular Foundation Models Is Reshaping Data Science


Tabular Data!

Recent advances in AI—ranging from systems capable of holding coherent conversations to those generating realistic video sequences—are largely attributable to artificial neural networks (ANNs). These achievements have been made possible by algorithmic breakthroughs and architectural innovations developed over the past fifteen years, and more recently by the emergence of large-scale computing infrastructures capable of training such networks on internet-scale datasets.

The main strength of this approach to machine learning, commonly referred to as deep learning, lies in its ability to automatically learn representations of complex data types—such as images or text—without relying on handcrafted features or domain-specific modeling. In doing so, deep learning has significantly extended the reach of traditional statistical methods, which were originally designed to analyze structured data organized in tables, such as those found in spreadsheets or relational databases.

Figure 1 : Until recently, neural networks were poorly suited to tabular data. [Image by author]

Given, on the one hand, the remarkable effectiveness of deep learning on complex data, and on the other, the immense economic value of tabular data—which still represents the core of the informational assets of many organizations—it is only natural to ask whether deep learning techniques can be successfully applied to such structured data. After all, if a model can tackle the hardest problems, why wouldn’t it excel at the easier ones?

Paradoxically, deep learning has long struggled with tabular data [8]. To understand why, it is useful to recall that its success hinges on the ability to uncover grammatical, semantic, or visual patterns from massive volumes of data. Put simply, the meaning of a word emerges from the consistency of the linguistic contexts in which it appears; likewise, a visual feature becomes recognizable through its recurrence across many images. In both cases, it is the internal structure and coherence of the data that enable deep learning models to generalize and transfer knowledge across different samples—texts or images—that share underlying regularities.

The situation is fundamentally different when it comes to tabular data, where each row typically corresponds to an observation involving multiple variables. Think, for example, of predicting a person’s weight based on their height, age, and gender, or estimating a household’s electricity consumption (in kWh) based on floor area, insulation quality, and outdoor temperature. A key point is that the value of a cell is only meaningful within the specific context of the table it belongs to. The same number might represent a person’s weight (in kilograms) in one dataset, and the floor area (in square meters) of a studio apartment in another. Under such conditions, it is hard to see how a predictive model could transfer knowledge from one table to another—the semantics are entirely dependent on context.

Tabular structures are thus highly heterogeneous, and in practice there exists an infinite variety of them to capture the diversity of real-world phenomena—ranging from financial transactions to galaxy structures or income disparities within urban areas.

This diversity comes at a cost: each tabular dataset typically requires its own dedicated predictive model, which cannot be reused elsewhere. 

To handle such data, data scientists most often rely on a class of models based on decision trees [7]. Their precise mechanics need not concern us here; what matters is that they are remarkably fast at inference, often producing predictions in under a millisecond. Unfortunately, like all classical machine learning algorithms, they must be retrained from scratch for each new table—a process that can take hours. Additional drawbacks include unreliable uncertainty estimation, limited interpretability, and poor integration with unstructured data—precisely the kind of data where neural networks shine.

The idea of building universal predictive models—similar to large language models (LLMs)—is clearly appealing: once pretrained, such models could be applied directly to any tabular dataset, without additional training or fine-tuning. Framed this way, the idea may seem ambitious, if not entirely unrealistic. And yet, this is precisely what Tabular Foundation Models (TFMs), developed by several research groups over the past year [2–4], have begun to achieve—with surprising success.

The sections that follow highlight some of the key innovations behind these models and compare them to existing techniques. More importantly, they aim to spark curiosity about a development that could soon reshape the landscape of data science.

What We’ve Learned from LLMs

To put it simply, a large language model (LLM) is a machine learning model trained to predict the next word in a sequence of text. One of the most striking features of these systems is that, once trained on massive text corpora, they exhibit the ability to perform a wide range of linguistic and reasoning tasks—even those they were never explicitly trained for. A particularly compelling example of this capability is their success at solving problems relying solely on a short list of input–output pairs provided in the prompt. For instance, to perform a translation task, it often suffices to supply a few translation examples.

This behavior is known as in-context learning (ICL). In this setting, learning and prediction occur on the fly, without any additional parameter updates or fine-tuning. This phenomenon—initially unexpected and almost miraculous in nature—is central to the success of generative AI. Recently, several research groups have proposed adapting the ICL mechanism to build Tabular Foundation Models (TFMs), designed to play for tabular data a role analogous to that of LLMs for text.

Conceptually, the construction of a TFM remains relatively straightforward. The first step involves generating a very large collection of synthetic tabular datasets with diverse structures and varying sizes—both in terms of rows (observations) and columns (features or covariates). In the second step, a single model—the foundation model proper—is trained to predict one column from all others within each table. In this framework, the table itself serves as a predictive context, analogous to the prompt examples used by an LLM in ICL mode.

The use of synthetic data offers several advantages. First, it avoids the legal risks associated with copyright infringement or privacy violations that currently complicate the training of LLMs. Second, it allows prior knowledge—an inductive bias—to be explicitly injected into the training corpus. A particularly effective strategy involves generating tabular data using causal models. Without delving into technical details, these models aim to simulate the underlying mechanisms that could plausibly give rise to the wide variety of data observed in the real world—whether physical, economic, or otherwise. In recent TFMs such as TabPFN-v2 and TabICL [3,4], tens of millions of synthetic tables have been generated in this way, each derived from a distinct causal model. These models are sampled randomly, but with a preference for simplicity, following Occam’s Razor—the principle that among competing explanations, the simplest one consistent with the data should be favored.

TFMs are all implemented using neural networks. While their architectural details vary from one implementation to another, they all incorporate one or more Transformer-based modules. This design choice can be explained, in broad terms, by the fact that Transformers rely on a mechanism known as attention, which enables the model to contextualize each piece of information. Just as attention allows a word to be interpreted considering its surrounding text, a suitably designed attention mechanism can contextualize the value of a cell within a table. Readers interested in exploring this topic—which is both technically rich and conceptually fascinating—are encouraged to consult references [2–4].

Figures 2 and 3 compare the training and inference workflows of traditional models with those of TFMs. Classical models such as XGBoost [7] must be retrained from scratch for each new table. They learn to predict a target variable y = f(x) from input features x, with training typically taking several hours, though inference is nearly instantaneous.

TFMs, by contrast, require a more expensive initial pretraining phase—on the order of a few dozen GPU-days. This cost is generally borne by the model provider but remains within reach for many organizations, unlike the prohibitive scale often associated with LLMs. Once pretrained, TFMs unify ICL-style learning and inference into a single pass: the table D on which predictions are to be made serves directly as context for the test inputs x. The TFM then predicts targets via a mapping y = f(xD), where the table D plays a role analogous to the list of examples provided in an LLM prompt.

Figure 2 : Training a conventional machine learning model and making predictions on a table. [Image by author]
Figure 3 : Training a tabular foundation model and performing universal predictions. [Image by author]

To summarize the discussion in a single sentence

TFMs are designed to learn a predictive model on-the-fly for tabular data, without requiring any training.

Blazing Performance

Key Figures

The table below provides indicative figures for several key aspects: the pretraining cost of a TFM, ICL-style adaptation time on a new table, inference latency, and the maximum supported table sizes for three predictive models. These include TabPFN-v2, a TFM developed at PriorLabs by Frank Hutter’s team; TabICL, a TFM developed at INRIA by Gaël Varoquaux’s group[1]; and XGBoost, a classical algorithm widely regarded as one of the strongest performers on tabular data.

Figure 4 : A performance comparison between two TFMs and a classical algorithm, [image by author]

These figures should be interpreted as rough estimates, and they are likely to evolve quickly as implementations continue to improve. For a detailed analysis, readers are encouraged to consult the original publications [2–4].

Beyond these quantitative aspects, TFMs offer several additional advantages over conventional approaches. The most notable are outlined below.

TFMs Are Well-Calibrated

A well-known limitation of classical models is their poor calibration—that is, the probabilities they assign to their predictions often fail to reflect the true empirical frequencies. In contrast, TFMs are well-calibrated by design, for reasons that are beyond the scope of this overview but that stem from their implicitly Bayesian nature [1].

Figure 5  : Calibration comparison across predictive models. Darker shades indicate higher confidence levels. TabPFN clearly produces the most reasonable confidence estimates. [Image adapted from [2], licensed under CC BY 4.0].

Figure 5 compares the confidence levels predicted by TFMs with those produced by classical models such as logistic regression and decision trees. The latter tend to assign overly confident predictions in regions where no data is observed and often exhibit linear artifacts that bear no relation to the underlying distribution. In contrast, the predictions from TabPFN appear to be significantly better calibrated.

TFMs Are Robust

The synthetic data used to pretrain TFMs—millions of causal structures—can be carefully designed to make the models highly robust to outliersmissing values, or non-informative features. By exposing the model to such scenarios during training, it learns to recognize and handle them appropriately, as illustrated in Figure 6.

Figure 6 : Robustness of TFMs to missing data, non-informative features, and outliers. [Image adapted from [3], licensed under CC BY 4.0]

TFMs Require Minimal Hyperparameter Tuning

One final advantage of TFMs is that they require very little hyperparameter tuning. In fact, they often outperform heavily optimized classical algorithms even when used with default settings, as illustrated in Figure 7.

Figure 7 : Comparative performance of a TFM versus other algorithms, both in default and fine-tuned settings. [image adapted from [3], licensed under CC BY 4.0]

To conclude, it is worth noting that ongoing research on TFMs suggests they also hold promise for improved explainability [3], fairness in prediction [5], and causal inference [6].

Every R&D Team Has Its Own Secret Sauce!

There is growing consensus that TFMs promise not just incremental improvements, but a fundamental shift in the tools and methods of data science. As far as one can tell, the field may gradually shift away from a model-centric paradigm—focused on designing and optimizing predictive models—toward a more data-centric approach. In this new setting, the role of a data scientist in industry will no longer be to build a predictive model from scratch, but rather to assemble a representative dataset that conditions a pretrained TFM.

Figure 8 : A fierce competition is underway between public and private labs to develop high-performing TFMs. [Image by author]

It is also conceivable that new methods for exploratory data analysis will emerge, enabled by the speed at which TFMs can now build predictive models on novel datasets and by their applicability to time series data [9].

These prospects have not gone unnoticed by startups and academic labs alike, which are now competing to develop increasingly powerful TFMs. The two key ingredients in this race—the more or less “secret sauce” behind each approach—are, on the one hand, the strategy used to generate synthetic data, and on the other, the neural network architecture that implements the TFM.

Here are two entry points for discovering and exploring these new tools:

  1. TabPFN (Prior Labs)
    A local Python library: tabpfn provides scikit-learn–compatible classes (fit/predict). Open access under an Apache 2.0–style license with attribution requirement.
  2. TabICL (Inria Soda)
    A local Python library: tabicl (pretrained on synthetic tabular datasets; supports classification and ICL). Open access under a BSD-3-Clause license.

Happy exploring!

  1. Müller, S., Hollmann, N., Arango, S. P., Grabocka, J., & Hutter, F. (2021). Transformers can do bayesian inferencearXiv preprint arXiv:2112.10510, publié pour ICLR 2021.
  2. Hollmann, N., Müller, S., Eggensperger, K., & Hutter, F. (2022). Tabpfn: A transformer that solves small tabular classification problems in a secondarXiv preprint arXiv:2207.01848, publié pour NeurIPS 2022.
  3. Hollmann, N., Müller, S., Purucker, L., Krishnakumar, A., Körfer, M., Hoo, S. B., … & Hutter, F. (2025). Accurate predictions on small data with a tabular foundation modelNature637(8045), 319-326.
  4. Qu, J., Holzmmüller, D., Varoquaux, G., & Morvan, M. L. (2025). TabICL: A tabular foundation model for in-context learning on large dataarXiv preprint arXiv:2502.05564, publié pour ICML 2025.
  5. Robertson, J., Hollmann, N., Awad, N., & Hutter, F. (2024). FairPFN: Transformers can do counterfactual fairnessarXiv preprint arXiv:2407.05732, publié pour ICML 2025.
  6. Ma, Y., Frauen, D., Javurek, E., & Feuerriegel, S. (2025). Foundation Models for Causal Inference via Prior-Data Fitted NetworksarXiv preprint arXiv:2506.10914.
  7. Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794).
  8. Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? Advances in neural information processing systems35, 507-520.
  9. Liang, Y., Wen, H., Nie, Y., Jiang, Y., Jin, M., Song, D., … & Wen, Q. (2024, August). Foundation models for time series analysis: A tutorial and survey. In Proceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining (pp. 6555-6565).

[1] Gaël Varoquaux is one of the original architects of the Scikit-learn API. He is also co-founder and scientific advisor at the startup Probabl.

Source link

#Rise #Tabular #Foundation #Models #Reshaping #Data #Science