Skip to content

Dataset benchmark for tabular data with textual features

Notifications You must be signed in to change notification settings

mrazmartin/TextTabBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧾 TextTabBench: A Dataset Pool for Text-Tabular Learning

This repository provides a curated pool of datasets designed for evaluating models that combine structured tabular features and unstructured text fields. These datasets support supervised learning tasks including regression and binary/multiclass classifcation.

This work is part of our workshop paper:

📄 Workshop Paper on OpenReview


IMPORTANT INFO

One of the datasets (California Houses) requires acceptance of a Kaggle competition to download correctly, we will try to remove this requirement to ease the downloading process soon. Until then, please accept the competition terms to use the dataset or remove it from the download config at dataset_configs.py. The competition's link is HERE.

Since our work on the benchmark, Kaggle updated its download functionality for larger .csv files. For the benchmark automatic donwlaod and file renaming to work correctly, please downgrade you kaggle instalation to 1.6.14 (has been tested to fix the issue.)

pip install kaggle==1.6.14

📁 Repository Structure

TextTabBench/
│
├── configs/                     # Configuration scripts for dataset metadata
├── datasets_notebooks/          # Jupyter notebooks to downlaod and explore each dataset
│    ├── datasets_files/              # Raw and preproc. data (inits upon running a notebook)
│    ├── paper_datasets/              # Datasets included in the workshop paper
│    │   ├── classification/          # Classification tasks
│    │   └── regression/              # Regression tasks
│    │
│    ├── extra_datasets/              # Additional high-quality datasets not included in the paper
│    │   └── ... 
│    └── other_datasets/              # Other datasets worth mentioning
│        └── ... 
│
├── examples/                    # Sample workflows and modeling pipelines
│    └── ... 
│    
├── src/                         # Source code for utilities and dataset processing
│   ├── dataloader_functions/    # Helpers functions
│   └── download_datasets/       # Script to download the datasets
│
├── requirements.txt             # Python dependencies
└── README.md                    # You're here!

📚 Dataset Groups

paper_datasets/ (Main Benchmark)

These datasets were selected and evaluated in our workshop paper. They were chosen based on:

  • Balanced and meaningful text features
  • Real-world tabular context
  • Predictive signal from both structured and textual features

They are split into classification/ and regression/ folders. Each dataset is provided as a Jupyter notebook with all required steps for inspection and use.

extra_datasets/ (Second Wave Additions)

These are high-quality datasts that did not make it to our benchmark at the time of writing / running our evaluations.

other_datasets/ (Valuable Additions)

These are good-quality datasets that did not make it into the final benchmark due to reasons like domain overlap, weaker/unclear text signals etc. Still, they are useful for experimentation and ablation studies.

-> This section is yet to be added. Main focus will be the CARTE and AutoML for Tabular with Text benchmark datasets.

Future Additions

Some datasets that narrowly missed the benchmark threshold may be added later under custom folders such as:

  • near_threshold_datasets/
  • text_too_short/
  • exploratory_datasets/

These are still useful under specific research contexts.


🚀 Getting Started

To explore and use the datasets in this repository, follow these steps:

1. Clone the Repository

git clone https://github.com/your-username/TextTabBench.git
cd TextTabBench

2. Set Up the Environment

pip install -r requirements.txt

3. Explore the Datasets

Open any dataset notebook to go throught the download and necessary preprocessing. You can also view the data at different stages of processing. At last, there is also space for some further analysis of the final data, ready to be ussed by any table-w-text solution pipeline.

4. Download the Datasets

Run

python src/download_datasets/download_datasets.py --task <reg/clf> --selection <default/extra/other/_specific_name_ (can list multiple)>

to download either a single dataset or a whole subset of them. It will save both the raw and processed version of the data to the datasets_files folder.

Issues:

When a single dataset fails to download, enter its notebook and toggle force_download to False to enfore a fresh download (otherwise it may get stuck on corrupted files) and run the download cell again.


🤝 Contributing

We welcome contributions to this dataset pool!

If you'd like to add a new dataset or improve existing ones, please follow these guidelines:

  • ✅ Provide a short description of the dataset
  • 🏷️ Clearly identify:
    • The target column
    • The text feature(s)
  • 📓 Include a Jupyter notebook following the structure of existing examples
  • 💬 Shortly explain why the dataset is valuable for evaluating text-tabular models

To contribute:

  1. Fork the repository
  2. Create a new branch
  3. Add your dataset and notebook
  4. Submit a pull request

📬 Citation

If you use this dataset pool in your research or build on our benchmark, please cite the following workshop paper:

Towards Benchmarking Foundation Models for Tabular Data With Text
OpenReview, 2025 https://openreview.net/pdf?id=yrmoQG9NAV

BibTeX:

@inproceedings{TextTabBench2024,
  title={Towards Benchmarking Foundation Models for Tabular Data With Text},
  author={Mraz, Das, Gupta and others},
  booktitle={ICML 2025 Workshop on Foundation Models for Structured Data (FMSD)},
  year={2025},
  url={https://openreview.net/pdf?id=yrmoQG9NAV}
}

About

Dataset benchmark for tabular data with textual features

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published