Skip to content

vijeyavarshini/synthetic_tabular_data_generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Generative AI for Tabular Data: Adult Census Income

This project demonstrates how to generate high-quality synthetic tabular data using a Tabular Variational Autoencoder (TVAE) from the Synthetic Data Vault (SDV) library.

Project Rubric & Details

Problem Chosen

Synthetic Tabular Data Generation.

Problem Description

In many machine learning applications, acquiring large, high-quality datasets is difficult due to privacy concerns (e.g., healthcare, finance) or simply a lack of data. This project focuses on generating high-fidelity synthetic tabular data that retains the statistical properties and correlations of the original dataset without exposing sensitive information.

Architecture

The core generative model used is a Tabular Variational Autoencoder (TVAE), provided by the Synthetic Data Vault (SDV) library. TVAE adapts the standard Variational Autoencoder architecture specifically for mixed-type tabular data (categorical and continuous variables).

graph TD
    A[Real Tabular Data<br/>Adult Census] --> B[SDV SingleTableMetadata<br/>Data Preprocessing]
    B --> C[TVAE Synthesizer<br/>Tabular Variational Autoencoder]
    
    subgraph TVAE Model
        C --> D[Encoder<br/>Learn Latent Space]
        D --> E[Latent Representation]
        E --> F[Decoder<br/>Reconstruct Distribution]
    end
    
    F --> G[Synthetic Tabular Data]
    
    A --> H[Machine Learning Evaluation<br/>Random Forest]
    G --> H
    H --> I[Predictive Utility<br/>Accuracy Comparison]
Loading

Dataset

The project utilizes the widely used Adult Census Income dataset.

  • Original Features: 15 columns including age, workclass, education, marital-status, occupation, race, sex, capital-gain/loss, hours-per-week, native-country, and income.
  • Cleaned Dataset Shape: 30,162 rows and 15 columns.

Preprocessing

  1. Missing Value Handling: Replaced ? values with standard NaN and dropped rows containing any missing values.
  2. Metadata Extraction: Used SDV's SingleTableMetadata to automatically detect column types (numerical vs. categorical).
  3. Encoding for Evaluation: Categorical columns were encoded using LabelEncoder before passing the data to the downstream Random Forest classifier for ML evaluation.
  4. Data Splitting: Split data into training (80%, 24,129 rows) and testing (20%, 6,033 rows) sets.

Experiment Design

  1. Train the TVAE model purely on the real training set.
  2. Generate a synthetic dataset identical in size to the real training set (24,129 rows).
  3. Train a Machine Learning classifier (Random Forest) on the real training data.
  4. Train another instance of the classifier on the synthetic training data.
  5. Compare the predictive accuracy of both models on a shared hold-out test set to evaluate Machine Learning Efficacy.

Model Implementation

  • Model: TVAESynthesizer
  • Epochs: 300
  • Batch Size: 500
  • Embedding Dimension: 128
  • Export: The trained model is saved as trained_tvae_model.pkl.

Other Implementation

  • Data Visualization: Plotted comparative histograms and KDE distributions for real vs. synthetic columns (e.g., Age) using matplotlib and seaborn.
  • Evaluation Classifier: RandomForestClassifier from scikit-learn initialized with a fixed random_state.

Results Quality

Visualizing the distributions (e.g., the age column) shows that the synthetic data accurately mirrors the real data. The overlapping KDE plots demonstrate that the TVAE captures the complex distribution shape of continuous variables efficiently.

Results Analysis

The synthetic data proves to be highly useful for downstream tasks. It maintains the inherent relationships between features, allowing a classifier trained entirely on "fake" data to learn rules applicable to real-world, unseen data.

Quantitative Analysis

Machine Learning Efficacy (Downstream Classification on Test Set):

  • Real Data Model Accuracy: 85.10%
  • Synthetic Data Model Accuracy: 82.58%

Insights

The TVAE model is highly capable of tabular data synthesis. The downstream ML utility remains robust, showing only a small ~2.5% drop in accuracy when compared to the model trained purely on real data. This confirms that synthetic data can effectively proxy real data in privacy-sensitive model development.

Workflow Evolution

  1. Data Loading & Preprocessing
  2. Metadata Extraction
  3. TVAE Training
  4. Data Synthesis
  5. Visual Comparison
  6. ML Utility Evaluation.

Limitations

  • TVAEs can struggle to perfectly recreate extremely imbalanced or highly skewed numerical variables.
  • Very high cardinality categorical columns can require massive embedding dimensions, slowing down training and potentially degrading quality.
  • Variational Autoencoders tend to produce slightly "smoothed" or blurred distributions compared to GAN-based models.

Future Scope

  • Experimenting with other SDV models like CTGAN (Conditional Tabular GAN) to compare synthesis quality.
  • Hyperparameter tuning for the TVAE model (adjusting layers, dimensions, and batch size).
  • Evaluating ML utility across a wider range of algorithms (e.g., XGBoost, Gradient Boosting, Deep Neural Networks).

Standalone Application

This Jupyter Notebook workflow can easily be adapted into a standalone Python script, an automated ETL pipeline step, or a Streamlit web application where users can upload sensitive CSV files and download anonymized, synthetic representations instantly.

Tech Used

  • Python 3
  • SDV (Synthetic Data Vault) (TVAESynthesizer, SingleTableMetadata)
  • pandas & numpy (Data Manipulation)
  • scikit-learn (LabelEncoder, RandomForestClassifier, Train-Test Split, Metrics)
  • matplotlib & seaborn (Data Visualization)

About

Generative AI based Synthetic Data Generation using TVAE

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors