Library to generate synthetic tabular data and evaluate quality of generated data.
The library was developed as part of the final qualification work at the ITIS Institute, Kazan Federal University.
A class to generate synthetic tabular data. We use open-source The Synthetic Data Vault python library, which is considered top-tier for this purpose. It ensures high realism and statistical fidelity in the generated data and is scalable for large datasets. Additionally, SDV allows for table metadata configuration to specify the type and format of generated data. Its user-friendly APIs and comprehensive documentation make it easy to integrate and use We compared 4 synthesizers of SDV on different types of datasets (see Model comparison section):
- Gaussian Copula Synthesizer (GC)
- Conditional GAN Synthesizer (CTGAN)
- Copula GAN Synthesizer (CGAN)
- Table Variational Autoencoder (TVAE)
As a result, we have developed an algorithm for choosing the best generation method depending on the characteristics of the input dataset. This is the core of the synthesizer.
![]() |
|---|
| Algorithm to choose best generation method depending on dataset |
Algorithm to choose best generation method depending on dataset
A class to evaluate the quality of synthetic data. It compares real and generated datasets and provides quality metrics, graphs.
In the modern context, the quality of synthetic data is evaluated based on three primary aspects:
- Fidelity
- Utility
- Privacy
The quality of the generated data for different purposes hinges on these various aspects. Some applications require high fidelity, while others may prioritize utility or privacy. There is no universal metric for evaluating synthetic data. To keep our focus concise, we will adopt fidelity as the primary aspect for assessing the quality of the synthetic data.
We use 3 key metrics to evaluate the quality of the generated data:
- Propensity score to determine the distinguishability between the real and generated synthetic data.
- Log-Cluster metric indicates how easily real and synthetic data can be clustered and, therefore, distinguished from each other.
- Prediction accuracy is used to measure how well a machine learning model trained on synthetic data performs on classification tasks using real data. We classify using a Random Forest Classifier, which has proven to be an effective classifier for many complex classification problems.
Additionaly we use:
- Quality Report from SDMetrics to evaluate data fidelity.
- Table evaluator by Baukebrenninkmeijer.
We chose the propensity score and log cluster metrics over other evaluated metrics, such as KL Divergence, because they are more intuitive to understand and use. These metrics can be easily utilized and interpreted by practitioners who may not be familiar with the complex mathematics and statistics behind synthetic data generation. Additionally, their straightforward implementation will encourage future researchers to expand upon this work. We also selected prediction accuracy as an evaluation metric because it reflects how effectively machine learning models perform using the generated synthetic data.
Based on the evaluation metrics, we compared generative methods for tabular data across different types of datasets. Datasets can be categorized
By their balance:
- Balanced: Range [0.9 * 100/n; 1.1 * 100/n], where n - is the number of classes of the target variable.
- Weakly balanced: Range [0.6 * 100/n; 1.4 * 100/n], where n - is the number of classes of the target variable.
- Not balanced: Anything outside the above two ranges. If the datasets do not have a target variable, we apply the K-Means clustering algorithm and analyze the distribution of clusters.
By their data type:
- Numerical: if numerical features amount more than 60%
- Categorical: if categorical features amount more than 60%
- Mixed: otherwise
Other factors to consider:
- Dimensionality: Number of features.
- Size: Number of rows.
Datasets selection.
To minimize systematic bias due to differences in datasets of the same type, careful attention was given to selecting datasets with similar characteristics. For all types, datasets were chosen with the same level of balance in terms of the output class.
In datasets with a large number of instances, smaller samples were selected while ensuring the preservation of the required characteristics.
For consistency, we generated synthetic datasets that match the size of the original datasets exactly.
No preprocessing operations on real data are performed before creating synthetic data. Dankar and Ibrahim (2021) noted that synthetic data with higher utility, that is, synthetic data with a higher propensity score, are generated when raw data are used for synthesis.
All results are in google spreadsheet, example for numerical weakly balanced datasets:

!pip install -qqq DataSynthesis131
from DataSynthesis131 import QualityEvaluator, Synthesizer, GenerationParamsPlease see documentation here to get more information and examples.
Google-colab with a demonstration of the library's work.
If you notice any inaccuracies, please add issue here.
Also you can find me in the Telegram.


