This repository contains source code for the experiments conducted in the AISTATS 2024 paper From Data Imputation to Data Cleaning - Automated Cleaning of Tabular Data Improves Downstream Predictive Performance.
First of all, use load_corrupt_and_test_datasets.ipynb to download and corrupt the datasets and setup the expected structure of the data directory.
run_experiment.py implements a simple CLI script (run-experiment), which allows to easily run experiments.
Conformal Data Cleaning:
run-experiment \
--task_id \
"42493" \
--error_fractions \
"0.01" \
"0.05" \
"0.1" \
"0.3" \
"0.5" \
--num_repetitions \
"3" \
--results_path \
"/conformal-data-cleaning/results/final-experiments" \
--models_path \
"/conformal-data-cleaning/models/final-experiments" \
--how_many_hpo_trials \
"50" \
experiment \
--confidence_level \
"0.999"ML Baseline:
run-experiment \
--task_id \
"42493" \
--error_fractions \
"0.01" \
"0.05" \
"0.1" \
"0.3" \
"0.5" \
--num_repetitions \
"3" \
--results_path \
"/conformal-data-cleaning/results/final-experiments" \
--models_path \
"/conformal-data-cleaning/models/final-experiments" \
--how_many_hpo_trials \
"50" \
baseline \
--method \
"AutoGluon" \
--method_hyperparameter \
"0.999"PyOD Baseline (not included in the paper):
run-experiment \
--task_id \
"42493" \
--error_fractions \
"0.01" \
"0.05" \
"0.1" \
"0.3" \
"0.5" \
--num_repetitions \
"3" \
--results_path \
"/conformal-data-cleaning/results/final-experiments" \
--models_path \
"/conformal-data-cleaning/models/final-experiments" \
--how_many_hpo_trials \
"50" \
baseline \
--method \
"PyodECOD" \
--method_hyperparameter \
"0.3"For Garf, please use main.py.
python \
main.py \
--task_id \
"42493" \
--error_fractions \
"0.01" \
"0.05" \
"0.1" \
"0.3" \
"0.5" \
--num_repetitions \
"3" \
--results_path \
"/conformal-data-cleaning/results/final-experiments" \
--models_path \
"/conformal-data-cleaning/models/final-experiments"We ran our experiments on Kubernetes using Helm. Please checkout the helm charts and change the image and imagePullSecrets settings in the values.yaml files accordingly to your setup.
Therefore, some read-write-many volumes are necessary to store the experiment results. Please checkout the infrastructure/k8s directory (and don't forget to setup the data directory as describe above).
Using make docker builds and pushes the necessary docker images and make helm-install uses deploy_experiments.py to start our experimental setup.
notebooks/evaluation contains notebooks we use for evaluating the results and 5_plotting.ipynb outputs the plots shown in the paper.