This repository provides a curated pool of datasets designed for evaluating models that combine structured tabular features and unstructured text fields. These datasets support supervised learning tasks including regression and binary/multiclass classifcation.
This work is part of our workshop paper:
📄 Workshop Paper on OpenReview
One of the datasets (California Houses) requires acceptance of a Kaggle competition to download correctly, we will try to remove this requirement to ease the downloading process soon. Until then, please accept the competition terms to use the dataset or remove it from the download config at dataset_configs.py. The competition's link is HERE.
Since our work on the benchmark, Kaggle updated its download functionality for larger .csv files. For the benchmark automatic donwlaod and file renaming to work correctly, please downgrade you kaggle instalation to 1.6.14 (has been tested to fix the issue.)
pip install kaggle==1.6.14
TextTabBench/
│
├── configs/ # Configuration scripts for dataset metadata
├── datasets_notebooks/ # Jupyter notebooks to downlaod and explore each dataset
│ ├── datasets_files/ # Raw and preproc. data (inits upon running a notebook)
│ ├── paper_datasets/ # Datasets included in the workshop paper
│ │ ├── classification/ # Classification tasks
│ │ └── regression/ # Regression tasks
│ │
│ ├── extra_datasets/ # Additional high-quality datasets not included in the paper
│ │ └── ...
│ └── other_datasets/ # Other datasets worth mentioning
│ └── ...
│
├── examples/ # Sample workflows and modeling pipelines
│ └── ...
│
├── src/ # Source code for utilities and dataset processing
│ ├── dataloader_functions/ # Helpers functions
│ └── download_datasets/ # Script to download the datasets
│
├── requirements.txt # Python dependencies
└── README.md # You're here!
These datasets were selected and evaluated in our workshop paper. They were chosen based on:
- Balanced and meaningful text features
- Real-world tabular context
- Predictive signal from both structured and textual features
They are split into classification/ and regression/ folders. Each dataset is provided as a Jupyter notebook with all required steps for inspection and use.
These are high-quality datasts that did not make it to our benchmark at the time of writing / running our evaluations.
These are good-quality datasets that did not make it into the final benchmark due to reasons like domain overlap, weaker/unclear text signals etc. Still, they are useful for experimentation and ablation studies.
-> This section is yet to be added. Main focus will be the CARTE and AutoML for Tabular with Text benchmark datasets.
Some datasets that narrowly missed the benchmark threshold may be added later under custom folders such as:
near_threshold_datasets/text_too_short/exploratory_datasets/
These are still useful under specific research contexts.
To explore and use the datasets in this repository, follow these steps:
git clone https://github.com/your-username/TextTabBench.git
cd TextTabBenchpip install -r requirements.txtOpen any dataset notebook to go throught the download and necessary preprocessing. You can also view the data at different stages of processing. At last, there is also space for some further analysis of the final data, ready to be ussed by any table-w-text solution pipeline.
Run
python src/download_datasets/download_datasets.py --task <reg/clf> --selection <default/extra/other/_specific_name_ (can list multiple)>to download either a single dataset or a whole subset of them. It will save both the raw and processed version of the data to the datasets_files folder.
When a single dataset fails to download, enter its notebook and toggle force_download to False to enfore a fresh download (otherwise it may get stuck on corrupted files) and run the download cell again.
We welcome contributions to this dataset pool!
If you'd like to add a new dataset or improve existing ones, please follow these guidelines:
- ✅ Provide a short description of the dataset
- 🏷️ Clearly identify:
- The target column
- The text feature(s)
- 📓 Include a Jupyter notebook following the structure of existing examples
- 💬 Shortly explain why the dataset is valuable for evaluating text-tabular models
To contribute:
- Fork the repository
- Create a new branch
- Add your dataset and notebook
- Submit a pull request
If you use this dataset pool in your research or build on our benchmark, please cite the following workshop paper:
Towards Benchmarking Foundation Models for Tabular Data With Text
OpenReview, 2025 https://openreview.net/pdf?id=yrmoQG9NAV
BibTeX:
@inproceedings{TextTabBench2024,
title={Towards Benchmarking Foundation Models for Tabular Data With Text},
author={Mraz, Das, Gupta and others},
booktitle={ICML 2025 Workshop on Foundation Models for Structured Data (FMSD)},
year={2025},
url={https://openreview.net/pdf?id=yrmoQG9NAV}
}