Create Siddharth_TabDDPM #88
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Added the completed model for TabDDPM Model
Implemented TabDDPM training pipeline for synthetic tabular data generation
Built end-to-end diffusion-based code following Kotelnikov et al.’s architecture (2209.15421v2), including scheduler, sampler and noise schedule modules.
Completed experiments using two evaluation protocols
50 / 50 real–synthetic split with 2-fold cross-validation, repeated 3 times per dataset
70 / 30 train–test split matching the paper’s original setup
Integrated comprehensive performance metrics
TSTR accuracy (MLP, Logistic Regression, XGBoost, Random Forest)
Jensen–Shannon Divergence (JSD)
Wasserstein Distance (WD)
Developed class-injection logic for missing labels
Automatically detect underrepresented classes in synthetic outputs and inject real samples to ensure compatibility with XGBoost and other downstream models.
Added dynamic epoch configuration
100 epochs for small / medium datasets
150 epochs for large datasets
Aligned hyperparameters and benchmarking with the paper
Matched learning rate schedule, batch size, and model depth exactly as in 2209.15421v2 to validate reproducibility.
Surpassed published benchmarks
Achieved an F1-score of 0.80 on the UCI Adult dataset versus the paper’s 0.795 benchmark
.
Containerized the full pipeline with Docker
Created Dockerfile and Compose scripts to encapsulate all dependencies for environment-independent execution.
Modularized dataset handling and preprocessing scripts
Encapsulated loading, cleaning, encoding and splitting logic into reusable modules for rapid onboarding of new datasets.
Managed experiments via GitHub
Employed feature branches, structured commits and CI-driven validation to track code, hyperparameters and results.
Automated final result aggregation
Wrote scripts to compile averaged evaluation metrics and divergence scores across repeats into a consolidated CSV report.