Nature                          volume  637, pages  319–326 (2025 )Cite this article                      Tabular da

Accurate predictions on small data with a tabular foundation model

submited by
Style Pass
2025-01-09 10:00:04

Nature volume  637, pages 319–326 (2025 )Cite this article

Tabular data, spreadsheets organized in rows and columns, are ubiquitous across scientific fields, from biomedicine to particle physics to economics and climate science1,2. The fundamental prediction task of filling in missing values of a label column based on the rest of the columns is essential for various applications as diverse as biomedical risk models, drug discovery and materials science. Although deep learning has revolutionized learning from raw data and led to numerous high-profile success stories3,4,5, gradient-boosted decision trees6,7,8,9 have dominated tabular data for the past 20 years. Here we present the Tabular Prior-data Fitted Network (TabPFN), a tabular foundation model that outperforms all previous methods on datasets with up to 10,000 samples by a wide margin, using substantially less training time. In 2.8 s, TabPFN outperforms an ensemble of the strongest baselines tuned for 4 h in a classification setting. As a generative transformer-based foundation model, this model also allows fine-tuning, data generation, density estimation and learning reusable embeddings. TabPFN is a learning algorithm that is itself learned across millions of synthetic datasets, demonstrating the power of this approach for algorithm development. By improving modelling abilities across diverse fields, TabPFN has the potential to accelerate scientific discovery and enhance important decision-making in various domains.

Throughout the history of artificial intelligence, manually created algorithmic components have been replaced with better-performing end-to-end learned ones. Hand-designed features in computer vision, such as SIFT (Scale Invariant Feature Transform)10 and HOG (Histogram of Oriented Gradients)11, have been replaced by learned convolutions; grammar-based approaches in natural language processing have been replaced by learned transformers12; and the design of customized opening and end-game libraries in game playing has been superseded by end-to-end learned strategies3,13. Here we extend this end-to-end learning to the ubiquitous domain of tabular data.

Leave a Comment