ML Course Project: Diabetes Readmission Prediction

Overview

The goal of this project is to predict the readmission of diabetic patients using data from the Diabetes 130-US Hospitals dataset spanning the years 1999 – 2008. This task is framed as a binary classification problem. The dataset was obtained from the UC Irvine Machine Learning Repository, which can be found here.

The project includes comprehensive data preprocessing, feature engineering, and the application of advanced machine learning models for predictive analytics. Detailed descriptions of the original and engineered features are provided in the Data section below.

For a full project report and summary of results, refer to Project_Summary.docx.

Installation

Note: this project was now developed to install an d run rather to be examined and checked for the workflow.

To set up and run the project environment:

Clone the repository.
Install required Python packages:

conda install -f environment.yml

Usage

Execute the main script to start the project:

python pyScripts/MainScript.py

By running the main script, the following steps will be executed:

Data preprocessing
Feature engineering
pipeline execution

File Structure

ML_Project/
├── Data_preparation
│   └── Feature_engineering.xlsx   # Describes the feature engineering process
├── EDA
│   └── EDA.ipynb                  # Exploratory Data Analysis notebook
├── PDFs
│   ├── AMLLS Project Description.pdf # Description documents provided by course hosts
│   └── AMLLS Project.pdf             # Description documents provided by course hosts
├── PO
│   ├── final_step_workflow.pptx      # Descriptive presentation for the collaborators 
│   └── workflow.pptx                 # Full workflow
├── Project_Summary.docx              # Full and informative written descriptive oif this project
├── README.md                         # README file
├── bsub                              # Bsub - bash scripting directory for the use of wexac user only
|
├── data
│   ├── IDS_mapping.csv                                     #This file contains mapping information for IDS (Intrusion Detection System) data.
│   ├── LGBM_top10_features.png                             #This file is an image showing the top 10 features selected by the LightGBM model.
│   ├── copula_train_set_300_epochs_4_numeric.csv           #This file contains a training dataset generated using the copula method with 300 epochs and 4 numeric variables.
│   ├── diabetes+130-us+hospitals+for+years+1999-2008.zip   #original data zip
│   ├── diabetic_data.csv                                   #This file contains the main dataset for the ML project, which includes information about diabetic patients.
│   ├── results_importance_lgbm.csv                         #This file contains the feature importance scores generated by the LightGBM model.
│   ├── results_importance_transposed_lgbm.csv              #This file contains the transposed feature importance scores generated by the LightGBM model.
│   ├── results_lgbm.csv                                    #This file contains the results of the ML model trained using LightGBM.
│   └── score_table.csv                                     #This file contains a table of scores related to the ML project.
|
├── environment.yml
└── pyScripts
    ├── AddRootDirectoriesToSysPath.py
    ├── BarModels                        # BarModels - Bar Cohen's directory
    │   ├── RF_Main_Run_BestParams.py    # This script runs the best parameters for the random forest model.
    │   ├── RF_Main_Run_FullScript.py    # This script runs the full script for the random forest model while running grid- search for the best parameters.
    │   ├── Rendom_forest_BC.py          # This script holds the classes used in the main script.
    │   ├── X_test_df.csv                # Final train and test sets used for training the model.
    │   ├── X_train_df.csv               # Final train and test sets used for training the model.
    │   ├── __main__.py
    │   ├── bsub                       # Bsub - bash scripting directory for the use of wexac user only
    │   ├── logs                       # Logs - directory for providing the real score of the model
    │   ├── personalClass              # PersonalClass - directory for the classes used during the project
    │   ├── results                    # Results - directory for the results of the model
    │   │   ├── Thumbs.db
    │   │   ├── feature_importance_plot.png          # Feature importance plot for the random forest model.
    │   │   ├── feature_importance_table.csv         # Feature importance table for the random forest model.
    │   │   └── prediction_table.csv                 # Prediction table for the random forest model.
    │   ├── y_test.csv
    │   └── y_train.csv
    ├── DefPipeLineClasses.py
    ├── GuyTrain
    │   ├── X_test_df.csv                   # Final train and test sets used for training the xgboost model.   
    │   ├── X_test_np.npy                   # Final train and test sets used for training the xgboost model.   
    │   ├── X_train_df.csv                  # Final train and test sets used for training the xgboost model.
    │   ├── X_train_np.npy                  # Final train and test sets used for training the xgboost model.
    │   ├── __init__.py                     # 
    │   ├── diabetic_data.csv               # Fulldata
    │   ├── feature_importance_script.py    # Generates feature importance plot for the xgboost model post-tuning.                   
    │   ├── xgboost_gridcv.py               # Hyperparameter tuning for xgboost using grid search cross-validation.       
    │   ├── xgboost_optuna.py               # Hyperparameter tuning for xgboost using Optuna.  
    │   ├── xgboost_train_grid_gpu.ipynb    # Hyperparameter tuning for xgboost using grid search cross-validation.                   
    │   ├── y_test.csv                      # Final train and test sets used for training the xgboost model.
    │   ├── y_test.npy                      # Final train and test sets used for training the xgboost model.
    │   ├── y_train.csv                     # Final train and test sets used for training the xgboost model.
    │   └── y_train.npy                     # Final train and test sets used for training the xgboost model.
    ├── LGBM.py                             # Saar main script                           
    ├── RunPipe.py                          # python file for running the pipeline
    ├── __init__.py                         
    |
    ├── classes
    │   ├── ConditionalTransformer.py
    │   ├── CopulaGenerator.py
    │   ├── SeeTheData.py
    │   ├── __init__.py
    │   └── evaluation_classes.py
    ├── deadendscript
    │   ├── __init__.py
    │   ├── disease_ids_conds.py            #This script contains function used for recategorizing the data in the dataset, as well as to store long lists usedto gnerate the dataset.
    │   ├── feature_importance_rnf_clf.py   #This script was used to test the feature importance on the copulaGANS generated balanced train set on 15 different seeds.        
    │   ├── generate_n_test_df.py           #
    │   └── graphs
    │       ├── Thumbs.db
    │       └── feature_importance_15_seeds_mean.png
    ├── featureImportanceDir
    │   ├── Feature_ImportanceClass.py
    │   ├── LGBM_feature_importance.png
    │   ├── Random Forest_feature_importance.png
    │   ├── Thumbs.db
    │   ├── XGBoost_feature_importance.png
    │   ├── X_train.csv
    │   ├── X_train_np.npy
    │   ├── __init__.py
    │   ├── feature_names.txt
    │   ├── temp_mainBelongToBCGonnabedeletWhenDone.py
    │   └── y_train.csv
    ├── main.py
    ├── prepare_data.py
    └── preprocessing_pipe.py

Contact

Saar Ezagouri - saare@weizmann.ac.il Guy Ilan - guy.ilan@weizmann.ac.il Bar Cohen - bar.cohen@weizmann.ac.il Project Link: https://github.com/brchn6/ML_Project.git

Workflow Checklist:

WORKFLOW:

1. The data is uploaded and preprocessed in the Data_preparation directory.

2. EDA is conducted in the EDA/EDA.ipynb notebook.

3. Feature engineering details are documented in the Data_preparation/Feature_engineering.xlsx file.

4. The main script is run in the pyScripts directory => got the split and GANS generated data.

5. now each of the team members is running the model on the same data and comparing the results.

6. The final results will be compared and the best model will be chosen.

WORKFLOW DETAILS:

Data upload and initial preprocessing occur within the Data_preparation/ directory. The data folder contains all the CSV files given to us at the beginning of the project as well as the final training set used for training the model:
1. diabetes_data.csv
2. IDS_mapping.csv
3. copula_train_set_300_epochs_4_numeric.csv
EDA is conducted in EDA/EDA.ipynb, providing insights necessary for model building.
Feature engineering details are documented in Data_preparation/Feature_engineering.xlsx.

pyScripts folder:

GuyTrain:

1. [X_test_df, X_train_df, y_test_df, y_train_df] - the final train and test sets used for training the model, 
customized to fit the xgboost model.

2. xgboost_gridcv.py - This script was used to hyperparameter-tune the xgboost model using grid search cross-validation.

3. xgboost_optuna.py - This script was used to hyperparameter-tune the xgboost model using Optuna.

4. feature_importance_script.py - contains the functions used in order to generate
the feature importance plot of the xgboost model after hyperparameter tuning.

BarModels:

1. __main__.py  is the main python script that run the classes build in Rendom_forest_BC.py file
2. Rendom_forest_BC.py hold the classes use in the main script
3. RF_Main_Run_BestParams.py is the script that run the best parameters for the random forest model
4. RF_Main_Run_FullScript.py is the script that run the full script for the random forest model while running grid- search for the best parameters.
5. bsub is a directory that hold the bash scripts that run the python scripts on the cluster
6. logs is a directory that hold the logs of the scripts that run on the cluster
7. personalClass is a directory that hold the classes that are used in the main script
8. results is a directory that hold the results of the model
9. X_test_df.csv , X_train_df.csv , y_test.csv , y_train.csv are the final train and test sets used for training the model

Saar main script:

1. LGBM.py - This script is used to train the LightGBM model on the dataset.

deadendscript:

1. disease_ids_conds.py - This script contains a function used for recategorizing the data in the dataset, 
as well as to store long lists used to generate the dataset.

2. feature_importance_rnd_clf.py - This script is used to generate the feature importance plot of the dataset using the random forest classifier.

3.generate_n_test_df.py - This script is used to generate the balanced datasets and evaluate the performance of the classifiers on the different datasets.
The classes CopulaGenerator, ConditionalTransformer, and ClassifierEvaluator are used in this script.

featureImportanceDir:

1. Feature_ImportanceClass.py - This class is used to generate the feature importance plot of the dataset using the random forest classifier.
2. LGBM_feature_importance.png - This image shows the feature importance scores generated by the LightGBM model.
3. Random Forest_feature_importance.png - This image shows the feature importance scores generated by the Random Forest model.
4. XGBoost_feature_importance.png - This image shows the feature importance scores generated by the XGBoost model.

classes:

1. ConditionalTransformer.py - This class is used to conditionally apply a transformer to the data. 
If the condition is set to True, the transformer is applied to the data, else the data is returned as is.
This class was used in the pipeline in order to make the GANS class conditional.
2. seeTheData.py - This class is used to visualize the data in the dataset.
3. evaluation_classifiers.py - This class is used to evaluate the performance of different classifiers on different datasets. The class is used to compare the performance of the classifiers on the original, smote, ctGAN, and copulaGAN datasets. The class is used to generate the score_table.csv file.
4. CopulaGenerator.py - This class is used to generate synthetic data in order to balance the dataset. Both ctGAN and copulaGAN can be used to generate and balance the dataset. It was used in order to create the final balanced dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ML Course Project: Diabetes Readmission Prediction

Overview

Installation

Note: this project was now developed to install an d run rather to be examined and checked for the workflow.

Usage

File Structure

Contact

Workflow Checklist:

WORKFLOW:

1. The data is uploaded and preprocessed in the Data_preparation directory.

2. EDA is conducted in the EDA/EDA.ipynb notebook.

3. Feature engineering details are documented in the Data_preparation/Feature_engineering.xlsx file.

4. The main script is run in the pyScripts directory => got the split and GANS generated data.

5. now each of the team members is running the model on the same data and comparing the results.

6. The final results will be compared and the best model will be chosen.

WORKFLOW DETAILS:

GuyTrain:

BarModels:

Saar main script:

deadendscript:

featureImportanceDir:

classes:

Thank you for your reading

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 436 Commits
Data_preparation		Data_preparation
EDA		EDA
PDFs		PDFs
PO		PO
bsub		bsub
data		data
pyScripts		pyScripts
Project_Summary.docx		Project_Summary.docx
README.md		README.md
environment.yml		environment.yml

brchn6/ML_Project

Folders and files

Latest commit

History

Repository files navigation

ML Course Project: Diabetes Readmission Prediction

Overview

Installation

Note: this project was now developed to install an d run rather to be examined and checked for the workflow.

Usage

File Structure

Contact

Workflow Checklist:

WORKFLOW:

1. The data is uploaded and preprocessed in the Data_preparation directory.

2. EDA is conducted in the EDA/EDA.ipynb notebook.

3. Feature engineering details are documented in the Data_preparation/Feature_engineering.xlsx file.

4. The main script is run in the pyScripts directory => got the split and GANS generated data.

5. now each of the team members is running the model on the same data and comparing the results.

6. The final results will be compared and the best model will be chosen.

WORKFLOW DETAILS:

GuyTrain:

BarModels:

Saar main script:

deadendscript:

featureImportanceDir:

classes:

Thank you for your reading

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages