Skip to content

MaratElagin/DataSynthesis131

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Datasynthesis131. Overview

PyPI version Downloads Downloads

Library to generate synthetic tabular data and evaluate quality of generated data.

The library was developed as part of the final qualification work at the ITIS Institute, Kazan Federal University.

The main workflow of the library

image

Library components

image

🎲 Synthesizer

A class to generate synthetic tabular data. We use open-source The Synthetic Data Vault python library, which is considered top-tier for this purpose. It ensures high realism and statistical fidelity in the generated data and is scalable for large datasets. Additionally, SDV allows for table metadata configuration to specify the type and format of generated data. Its user-friendly APIs and comprehensive documentation make it easy to integrate and use We compared 4 synthesizers of SDV on different types of datasets (see Model comparison section):

As a result, we have developed an algorithm for choosing the best generation method depending on the characteristics of the input dataset. This is the core of the synthesizer.

image
Algorithm to choose best generation method depending on dataset

Algorithm to choose best generation method depending on dataset

📐 QualityEvaluator

A class to evaluate the quality of synthetic data. It compares real and generated datasets and provides quality metrics, graphs.

Metrics

In the modern context, the quality of synthetic data is evaluated based on three primary aspects:

  1. Fidelity
  2. Utility
  3. Privacy

The quality of the generated data for different purposes hinges on these various aspects. Some applications require high fidelity, while others may prioritize utility or privacy. There is no universal metric for evaluating synthetic data. To keep our focus concise, we will adopt fidelity as the primary aspect for assessing the quality of the synthetic data.

We use 3 key metrics to evaluate the quality of the generated data:

  • Propensity score to determine the distinguishability between the real and generated synthetic data.

$$ pMSE\quad=\quad\frac{1}{N}\sum_{n}\left(\widehat{p}_{n}-0.5\right)^{2} $$

  • Log-Cluster metric indicates how easily real and synthetic data can be clustered and, therefore, distinguished from each other.

$$ U_c(X_R, X_S) = \log\left( \frac{1}{G} \sum_{j=1}^G \left[ \frac{n_j^R}{n_j} - c \right]^2 \right), \quad c = \frac{n^R}{n^R + n^S} $$

  • Prediction accuracy is used to measure how well a machine learning model trained on synthetic data performs on classification tasks using real data. We classify using a Random Forest Classifier, which has proven to be an effective classifier for many complex classification problems.

Additionaly we use:

We chose the propensity score and log cluster metrics over other evaluated metrics, such as KL Divergence, because they are more intuitive to understand and use. These metrics can be easily utilized and interpreted by practitioners who may not be familiar with the complex mathematics and statistics behind synthetic data generation. Additionally, their straightforward implementation will encourage future researchers to expand upon this work. We also selected prediction accuracy as an evaluation metric because it reflects how effectively machine learning models perform using the generated synthetic data.

Models comparison

Based on the evaluation metrics, we compared generative methods for tabular data across different types of datasets. Datasets can be categorized

By their balance:

  • Balanced: Range [0.9 * 100/n; 1.1 * 100/n], where n - is the number of classes of the target variable.
  • Weakly balanced: Range [0.6 * 100/n; 1.4 * 100/n], where n - is the number of classes of the target variable.
  • Not balanced: Anything outside the above two ranges. If the datasets do not have a target variable, we apply the K-Means clustering algorithm and analyze the distribution of clusters.

By their data type:

  • Numerical: if numerical features amount more than 60%
  • Categorical: if categorical features amount more than 60%
  • Mixed: otherwise

Other factors to consider:

  • Dimensionality: Number of features.
  • Size: Number of rows.

Datasets selection.

To minimize systematic bias due to differences in datasets of the same type, careful attention was given to selecting datasets with similar characteristics. For all types, datasets were chosen with the same level of balance in terms of the output class.

In datasets with a large number of instances, smaller samples were selected while ensuring the preservation of the required characteristics.

For consistency, we generated synthetic datasets that match the size of the original datasets exactly.

No preprocessing operations on real data are performed before creating synthetic data. Dankar and Ibrahim (2021) noted that synthetic data with higher utility, that is, synthetic data with a higher propensity score, are generated when raw data are used for synthesis.

All results are in google spreadsheet, example for numerical weakly balanced datasets: image

How to use?

Installation

!pip install -qqq DataSynthesis131
from DataSynthesis131 import QualityEvaluator, Synthesizer, GenerationParams

Please see documentation here to get more information and examples.

Demo

Google-colab with a demonstration of the library's work.

Live demo

Contacts

If you notice any inaccuracies, please add issue here.

Also you can find me in the Telegram.

About

Library to generate synthetic tabular data

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages