This repository contains resources developed within the following paper:
A. Dargahi Nobari, and D. Rafiei. "TabulaX: Leveraging Large Language Models for Multi-Class Table Transformations", VLDB, 2025
You may check the paper (PDF) for more information.
Several libraries are used in the project. You can use the provided environment.yml file to create the conda environment for the project or use requirements.txt to install the dependencies with pip.
If you prefer not to use the environment file, the environment can be set up by the following command.
conda create -n tgen python=3.11
conda activate tgen
conda install pip
pip install nltk==3.8.1
pip install numpy==1.26.4
pip install openai==1.35.7
pip install scikit-learn==1.5.1
pip install scipy==1.13.1
pip install tqdm==4.66.4 # if you will use tqdm
pip install numpy==1.26.4
pip install pandas==2.2.3
# requirements.txt created by: pip list --format=freeze > requirements.txt
# environment.yml created by: conda env export | grep -v "^prefix: " > environment.yml
Two main directories are in the repo: data, and src.
All datasets are included in this repo in Datasets directory. Each dataset contains several tables (each as a folder) and each table contains source.csv, target.csv, and ground truth.csv. The datasets available in this file are:
AutoJoin: The Web Tables (WT) dataset.FlashFill: The Spreadsheet (SS) dataset.ALL_TDE: The Table Transformation (TT) dataset.DataXFormer: The Knowledge Based Web Tables (KBWT) dataset.
The source files are located in src/LLM_pipeline directory.
If you are using GPT models, make sure your OpenAI API key is stored inside openai.key file inside this directory.
If you are using LLaMA 3 models, You need to use vLLM to serve the model on local.
If you need to run the code, you just need to use run_pipeline.py (see the Running the pipeline section). To run the input classifier use the *_classifier.py scripts in classifier directory.
This directory contains three sub-directories.
The resources required for automatic input table classification.
-
prompts/*: Prompt templates for each classification model. -
classifierutil.py: The library for general function required for classification. -
DFX_classes.csv: Ground truth classes for KBWT dataset. -
TDE_classes.csv: Ground truth classes for TT dataset. -
report_metrics.py: Helper functions to generate the classification performance at the end of classification are in this library. -
gpt_classifier.py: Run this file to use GPT models for input classification. Some values may be edited inside the file.MODEL_NAME: The name and version for the GPT model. Default is"gpt-4o-2024-05-13".PROMPT_CACHE_PATH: The cache directory for model prompts. Make sure the directory exists. The default value isBASE_PATH / "cache/classifier_prompts"ALL_CLASSES_JSON: The path for the output file with predicted classes.EXAMPLE_SIZE: Number of examples provided to facilitate the classification. The default value is 5.DS_PATHS: A list of the path to datasets to be classified.
-
llama_classifier.py: Run this file to use LLaMA 3 models for input classification. Some values may be edited inside the file.MODEL_NAME: The name and version for the LLaMA model. Default is"llama3.1-8b".PROMPT_CACHE_PATH: The cache directory for model prompts. Make sure the directory exists. The default value isBASE_PATH / "cache/classifier_prompts"ALL_CLASSES_JSON: The path for the output file with predicted classes.EXAMPLE_SIZE: Number of examples provided to facilitate the classification. The default value is 5.DS_PATHS: A list of the path to datasets to be classified.
The libraries including transformation functions. Functions and modules in this directory will be imported in the other script and are not executables.
prompts/*: Prompt templates for each LLM given each transformation class.basic.py: The functions and components to generate output by directly prompting an LLM.algorithmic.py: The functions and components to generate transformations for algorithmic inputs.general.py: The functions and components to generate output for general-class inputs.numeric.py: The functions and components to generate transformations for numeric inputs.string.py: The functions and components to generate transformations for string inputs.
The misc helper libraries. Functions and modules in this directory will be imported in the other script and are not executables.
dataset.py: The functions and components to load and sample the data.JoinEval.py: The functions to evaluate and report metrics on table join.
To run the transformation pipeline, run the run_pipeline.py script.
Some values may be edited inside the file.
ED_CACHE_PATH: The cache file path for edit distance values. Make sure the directory exists. The default value isBASE_PATH / "cache/edit_distance/ed.pkl"MODEL_NAME: The LLM that is used for transformation generation. Supported models are"gpt-4o-2024-05-13", "gpt-4o-mini-2024-07-18", "llama3.1-8b".BASIC_PROMPT: If the value is set to true, the code will basically prompt the LLM instead of running the framework. This is only used for the baselines and should be set toFalseEXAMPLE_SIZE: Number of examples provided to facilitate the transformation. The default value is 5.MATCHING_TYPE: The matching strategy for joining tables. Supported values are["edit_dist", "exact"].CLASSIFICATION_TYPE: The classification approach for joining tables. Supported values are['golden', gpt_classifier']. "golden" uses ground truth classifier.DS_PATH: Path for the dataset directory.OUTPUT_DIR: The path to store output files and performance report.
Please cite the paper, If you used the codes in this repository.
@article{tabulax,
author = {Nobari, Arash Dargahi and Rafiei, Davood},
title = {TabulaX: Leveraging Large Language Models for Multi-Class Table Transformations},
year = {2025},
issue_date = {July 2025},
publisher = {VLDB Endowment},
volume = {18},
number = {11},
issn = {2150-8097},
url = {https://doi.org/10.14778/3749646.3749657},
doi = {10.14778/3749646.3749657},
abstract = {The integration of tabular data from diverse sources is often hindered by inconsistencies in formatting and representation, posing significant challenges for data analysts and personal digital assistants. Existing methods for automating tabular data transformations are limited in scope, often focusing on specific types of transformations or lacking interpretability. In this paper, we introduce TabulaX, a novel framework that leverages Large Language Models (LLMs) for multi-class column-level tabular transformations. TabulaX first classifies input columns into four transformation types—string-based, numerical, algorithmic, and general—and then applies tailored methods to generate human-interpretable transformation functions, such as numeric formulas or programming code. This approach enhances transparency and allows users to understand and modify the mappings. Through extensive experiments on real-world datasets from various domains, we demonstrate that TabulaX outperforms existing state-of-the-art approaches in terms of accuracy, supports a broader class of transformations, and generates interpretable transformations that can be efficiently applied.},
journal = {Proc. VLDB Endow.},
month = sep,
pages = {3826–3839},
numpages = {14},
keywords = {large language models, heterogeneous table join, data integration, data transformation, data cleaning and transformation}
}