Skip to content

KaiChen9909/tab_bench

Repository files navigation

DP Tabular Data Synthesis Benchmark

This is a benchmark for dp tabular data synthesis. The necessary code for the paper is all included in this repository.

Introduction

This benchmark is based on the following algorithms.

Algorithms Link
AIM AIM: An Adaptive and Iterative Mechanism for Differentially Private Synthetic Data
DP-MERF DP-MERF: Differentially Private Mean Embeddings with Random Features for Practical Privacy-Preserving Data Generation
GEM Iterative Methods for Private Synthetic Data: Unifying Framework and New Methods
Private-GSD Generating Private Synthetic Data with Genetic Algorithms
PrivMRF Data Synthesis via Differentially Private Markov Random Fields
PrivSyn PrivSyn: Differentially Private Data Synthesis
RAP++ Private Synthetic Data for Multitask Learning and Marginal Queries
TabDDPM TabDDPM: Modelling Tabular Data with Diffusion Models

Quick Start

Hyper-parameter Introduction

The code for running experiments is in main.py. The detailed description of the hyper-parameters are give as follows.

  • method: which synthesis method you will run.
  • dataset: name of dataset.
  • device: the device used for running algorithms.
  • epsilon: DP parameter, which must be delivered when running code.
  • --delta: DP parameter, which is set to $1e-5$ by default.
  • --num_preprocess: preprocessing method for numerical attributes, which is set to uniform binning by default.
  • --rare_threshold: threshold of preprocessing method for categorical attributes, which is set to $0.2%$ by default.
  • --sample_device: device used for sample data, by default is set to the same as running device.
  • --test: hyper-parameter used for testing and debug.

Preparation

The necessary packages for the environment are listed in file requirement.txt. Firstly, make sure the datasets are put in the correct fold (in the following examples, the fold is data/bank, and the necessary dataset has already been provided). In this repository, the evaluation model is already tuned so users do not need any operation. Otherwise, you should tune the evaluation model (using the following code) before any further operation.

python evaluator/tune_eval_model.py bank mlp cv cuda:0

Overall Evaluation

After you activate your enviroment, try the following code to make an overall evaluation. In our paper, we by default set num_preprocess to be "uniform_kbins" except for DP-MERF and TabDDPM, and set rare_threshold to 0.002 for overall evaluation.

python main.py aim bank cuda:0 1.0 --num_preprocess uniform_kbins --rare_threshold 0.002

Preprocessing Investigation

If you want to try other preprocessing methods or preprocessing hyper-parameter settings, you can modify the value of preprocessing hyper-parameters like this

python main.py aim bank cuda:0 1.0 --num_preprocess privtree --rare_threshold 0.01

Module Comparison

In the experiment section of the paper, we compare different modules by comparing the performances of different reconstructed algorithms. These reconstructed algorithms are allocated new names, which can be delivered to method. For example, if you want to try PrivSyn selector with generative network synthesizer, you can try

python main.py gem_syn bank cuda:0 1.0 --num_preprocess uniform_kbins --rare_threshold 0.002

Results Collection

The code for evaluation is in file evaluator/eval_seeds.py. By default, we generate data 5 times and conduct evaluation each time we generate the data. The results are the average of all evaluations. All the results are collected in JSON format and saved in the fold exp/{name of dataset}/{name of method}, which can be used for further analysis.

Acknowledge

Part of the code is from AIM, DP-MERF, GEM, Private-GSD, PrivMRF, PrivSyn, RAP++, TabDDPM. We sincerely thank them for their contribution to the community.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages