Conformal Data Cleaning

This repository contains source code for the experiments conducted in the AISTATS 2024 paper From Data Imputation to Data Cleaning - Automated Cleaning of Tabular Data Improves Downstream Predictive Performance.

Run Experiments

First of all, use load_corrupt_and_test_datasets.ipynb to download and corrupt the datasets and setup the expected structure of the data directory.

run_experiment.py implements a simple CLI script (run-experiment), which allows to easily run experiments.

Conformal Data Cleaning:

run-experiment \
	--task_id \
	"42493" \
	--error_fractions \
	"0.01" \
	"0.05" \
	"0.1" \
	"0.3" \
	"0.5" \
	--num_repetitions \
	"3" \
	--results_path \
	"/conformal-data-cleaning/results/final-experiments" \
	--models_path \
	"/conformal-data-cleaning/models/final-experiments" \
	--how_many_hpo_trials \
	"50" \
	experiment \
	--confidence_level \
	"0.999"

ML Baseline:

run-experiment \
	--task_id \
	"42493" \
	--error_fractions \
	"0.01" \
	"0.05" \
	"0.1" \
	"0.3" \
	"0.5" \
	--num_repetitions \
	"3" \
	--results_path \
	"/conformal-data-cleaning/results/final-experiments" \
	--models_path \
	"/conformal-data-cleaning/models/final-experiments" \
	--how_many_hpo_trials \
	"50" \
	baseline \
	--method \
	"AutoGluon" \
	--method_hyperparameter \
	"0.999"

PyOD Baseline (not included in the paper):

run-experiment \
	--task_id \
	"42493" \
	--error_fractions \
	"0.01" \
	"0.05" \
	"0.1" \
	"0.3" \
	"0.5" \
	--num_repetitions \
	"3" \
	--results_path \
	"/conformal-data-cleaning/results/final-experiments" \
	--models_path \
	"/conformal-data-cleaning/models/final-experiments" \
	--how_many_hpo_trials \
	"50" \
	baseline \
	--method \
	"PyodECOD" \
	--method_hyperparameter \
	"0.3"

For Garf, please use main.py.

python \
	main.py \
	--task_id \
	"42493" \
	--error_fractions \
	"0.01" \
	"0.05" \
	"0.1" \
	"0.3" \
	"0.5" \
	--num_repetitions \
	"3" \
	--results_path \
	"/conformal-data-cleaning/results/final-experiments" \
	--models_path \
	"/conformal-data-cleaning/models/final-experiments"

Run our Experimental Setup

We ran our experiments on Kubernetes using Helm. Please checkout the helm charts and change the image and imagePullSecrets settings in the values.yaml files accordingly to your setup. Therefore, some read-write-many volumes are necessary to store the experiment results. Please checkout the infrastructure/k8s directory (and don't forget to setup the data directory as describe above).

Using make docker builds and pushes the necessary docker images and make helm-install uses deploy_experiments.py to start our experimental setup.

Evaluation

notebooks/evaluation contains notebooks we use for evaluating the results and 5_plotting.ipynb outputs the plots shown in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
conformal_data_cleaning		conformal_data_cleaning
data		data
garf		garf
infrastructure		infrastructure
models		models
notebooks		notebooks
plots		plots
processed		processed
results		results
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Conformal Data Cleaning

Run Experiments

Run our Experimental Setup

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

se-jaeger/conformal-data-cleaning

Folders and files

Latest commit

History

Repository files navigation

Conformal Data Cleaning

Run Experiments

Run our Experimental Setup

Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages