Generative AI for Tabular Data: Adult Census Income

This project demonstrates how to generate high-quality synthetic tabular data using a Tabular Variational Autoencoder (TVAE) from the Synthetic Data Vault (SDV) library.

Project Rubric & Details

Problem Chosen

Synthetic Tabular Data Generation.

Problem Description

In many machine learning applications, acquiring large, high-quality datasets is difficult due to privacy concerns (e.g., healthcare, finance) or simply a lack of data. This project focuses on generating high-fidelity synthetic tabular data that retains the statistical properties and correlations of the original dataset without exposing sensitive information.

Architecture

The core generative model used is a Tabular Variational Autoencoder (TVAE), provided by the Synthetic Data Vault (SDV) library. TVAE adapts the standard Variational Autoencoder architecture specifically for mixed-type tabular data (categorical and continuous variables).

graph TD
    A[Real Tabular Data<br/>Adult Census] --> B[SDV SingleTableMetadata<br/>Data Preprocessing]
    B --> C[TVAE Synthesizer<br/>Tabular Variational Autoencoder]
    
    subgraph TVAE Model
        C --> D[Encoder<br/>Learn Latent Space]
        D --> E[Latent Representation]
        E --> F[Decoder<br/>Reconstruct Distribution]
    end
    
    F --> G[Synthetic Tabular Data]
    
    A --> H[Machine Learning Evaluation<br/>Random Forest]
    G --> H
    H --> I[Predictive Utility<br/>Accuracy Comparison]

Dataset

The project utilizes the widely used Adult Census Income dataset.

Original Features: 15 columns including age, workclass, education, marital-status, occupation, race, sex, capital-gain/loss, hours-per-week, native-country, and income.
Cleaned Dataset Shape: 30,162 rows and 15 columns.

Preprocessing

Missing Value Handling: Replaced ? values with standard NaN and dropped rows containing any missing values.
Metadata Extraction: Used SDV's SingleTableMetadata to automatically detect column types (numerical vs. categorical).
Encoding for Evaluation: Categorical columns were encoded using LabelEncoder before passing the data to the downstream Random Forest classifier for ML evaluation.
Data Splitting: Split data into training (80%, 24,129 rows) and testing (20%, 6,033 rows) sets.

Experiment Design

Train the TVAE model purely on the real training set.
Generate a synthetic dataset identical in size to the real training set (24,129 rows).
Train a Machine Learning classifier (Random Forest) on the real training data.
Train another instance of the classifier on the synthetic training data.
Compare the predictive accuracy of both models on a shared hold-out test set to evaluate Machine Learning Efficacy.

Model Implementation

Model: TVAESynthesizer
Epochs: 300
Batch Size: 500
Embedding Dimension: 128
Export: The trained model is saved as trained_tvae_model.pkl.

Other Implementation

Data Visualization: Plotted comparative histograms and KDE distributions for real vs. synthetic columns (e.g., Age) using matplotlib and seaborn.
Evaluation Classifier: RandomForestClassifier from scikit-learn initialized with a fixed random_state.

Results Quality

Visualizing the distributions (e.g., the age column) shows that the synthetic data accurately mirrors the real data. The overlapping KDE plots demonstrate that the TVAE captures the complex distribution shape of continuous variables efficiently.

Results Analysis

The synthetic data proves to be highly useful for downstream tasks. It maintains the inherent relationships between features, allowing a classifier trained entirely on "fake" data to learn rules applicable to real-world, unseen data.

Quantitative Analysis

Machine Learning Efficacy (Downstream Classification on Test Set):

Real Data Model Accuracy: 85.10%
Synthetic Data Model Accuracy: 82.58%

Insights

The TVAE model is highly capable of tabular data synthesis. The downstream ML utility remains robust, showing only a small ~2.5% drop in accuracy when compared to the model trained purely on real data. This confirms that synthetic data can effectively proxy real data in privacy-sensitive model development.

Workflow Evolution

Data Loading & Preprocessing
Metadata Extraction
TVAE Training
Data Synthesis
Visual Comparison
ML Utility Evaluation.

Limitations

TVAEs can struggle to perfectly recreate extremely imbalanced or highly skewed numerical variables.
Very high cardinality categorical columns can require massive embedding dimensions, slowing down training and potentially degrading quality.
Variational Autoencoders tend to produce slightly "smoothed" or blurred distributions compared to GAN-based models.

Future Scope

Experimenting with other SDV models like CTGAN (Conditional Tabular GAN) to compare synthesis quality.
Hyperparameter tuning for the TVAE model (adjusting layers, dimensions, and batch size).
Evaluating ML utility across a wider range of algorithms (e.g., XGBoost, Gradient Boosting, Deep Neural Networks).

Standalone Application

This Jupyter Notebook workflow can easily be adapted into a standalone Python script, an automated ETL pipeline step, or a Streamlit web application where users can upload sensitive CSV files and download anonymized, synthetic representations instantly.

Tech Used

Python 3
SDV (Synthetic Data Vault) (TVAESynthesizer, SingleTableMetadata)
pandas & numpy (Data Manipulation)
scikit-learn (LabelEncoder, RandomForestClassifier, Train-Test Split, Metrics)
matplotlib & seaborn (Data Visualization)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
adult		adult
.gitignore		.gitignore
README.md		README.md
genai.ipynb		genai.ipynb
synthetic_adult_data.csv		synthetic_adult_data.csv
trained_tvae_model.pkl		trained_tvae_model.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative AI for Tabular Data: Adult Census Income

Project Rubric & Details

Problem Chosen

Problem Description

Architecture

Dataset

Preprocessing

Experiment Design

Model Implementation

Other Implementation

Results Quality

Results Analysis

Quantitative Analysis

Insights

Workflow Evolution

Limitations

Future Scope

Standalone Application

Tech Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Generative AI for Tabular Data: Adult Census Income

Project Rubric & Details

Problem Chosen

Problem Description

Architecture

Dataset

Preprocessing

Experiment Design

Model Implementation

Other Implementation

Results Quality

Results Analysis

Quantitative Analysis

Insights

Workflow Evolution

Limitations

Future Scope

Standalone Application

Tech Used

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages