GitHub - elayer/Fetal-Health-Classifier-Project: This project aims to predict the health condition of fetal babies being normal, suspect, or pathological (potential for having a disease or some unhealthy condition).

Fetal Health Condition Classifier - Overview:

Created a model to classify for babies in fetal development, cardiotocography exams on whether the fetus has normal health conditions, is suspect of having some pathology, or has some pathological condition.
Engineered new features utilizing Linear Discriminant Analysis and KMeans Clustering. I also performed PCA to orchestrate further and visualize further class separability, but chose not to include the components as features due to the number of components and potential overfitting (could attempt to build new models with them in the future however, if I come back to this project).
Began model building with Support Vector Machine, K-Nearest Neighbors, Logistic Regression, Random Forest Classifier, and AdaBoost Classifier. I then used optuna for hyperparameter optmimization with XGBoost Classifier and CatBoost Classifier.
Created an API for potential clients using Flask (picture of an example input of data included).
UPDATE: I recently uploaded and included results when implementing artifical data using SMOTE. Using this algorithm greatly enchanced evaluation metrics, but in a practical setting, we have to keep in mind this is artificial data and may not be completely representtative of real-world data. The current flask still uses the CatBoost model without SMOTE.

The pickled model containing the SMOTE balanced data of the CatBoost Classifier is too large to upload to GitHub (44MB). Therefore, that model will have to remain local for the time being. If you would like access to the model file, feel free to contact me.

Code and Resources Used:

Python Version: 3.8.5

Packages: numpy, pandas, scipy, matplotlib, seaborn, sklearn, xgboost, catboost, sklearn packages that include but are not limited to: (pca, lda, kmeans, random forest classifier, svm, logisitc regression, various preprocessing and model selection packages), optuna

Web Framework Requirements Command: pip install -r requirements.txt

References:

Various project structure and process elements were learned from Ken Jee's YouTube series: https://www.youtube.com/watch?v=MpF9HENQjDo&list=PL2zq7klxX5ASFejJj80ob9ZAnBHdz5O1t
Wikipedia article concerning cardiotocography exams and information about them: https://en.wikipedia.org/wiki/Cardiotocography#:~:text=Cardiotocography%20(CTG)%20is%20a%20technique,monitoring%20is%20called%20a%20cardiotocograph
Disclaimer regarding the data: I did NOT scrape or collect this data myself originally. The data used for this project was obtained from Kaggle (source in the following link): https://www.kaggle.com/datasets/andrewmvd/fetal-health-classification

Data Cleaning

Since the data was already thoroughly cleaned upon obtaining the dataset, there is very minimal cleaning tasks carried out in this particular project. I did perform some outlier detection, but chose to include any existing outliers as they could be important towards the analysis given the size of the dataset and the integrity of how the data was collected.

EDA

Some key findings including an example of a test datapoint being classified in using Flask is included below. I noticed that there was a consistent heartrate baseline for pathological records having higher prolonged decelerations. You can then see for higher values of prolonged decelerations (far right column) in the pairplot, there are lower values for accelerations and uterine contractions in general than normal records.

I also include the separation identified by PCA and how these distinctions in the cardiotocography exam metrics can be used to identify whether a datapoint is normal or pathological. In addition, there is also the application of LDA which results in a 2D visual where the three class distinctions can be seen between the linear discriminants.

Update: I include a graph of the ROC AUC Curve scores for the CatBoost Classifier using the SMOTE algorithm data included.

Target Labels can be translated as follows (according to the data source, each record was labelled by three Obstetritians):

1.0 -> Normal

2.0 -> Suspect

3.0 -> Pathological

Model Building

Before building any models, I included the linear discriminants from my LDA application as well as clusters created from applying KMeans Clustering to the dataset as new features. I then scaled the data using MinMaxScaler for the Support Vector Machine implementation, and StandardScaler for all other models attempted.

I began model testing with the Support Vector Machine, since we are interested in creating an optimal class seperability, and then attempted K-Nearest Neighbors and Logistic Regression to compare it with these different algorithms. Oddly enough, the models performed better without stratification, but in practice, it may be more beneficial and conducive to practicality to stratify the the testing data since there were more normal records than pathological records.
This then led me to try Random Forest and AdaBoost Classifier in tandem with StratifiedKFold to ensure balanced classes while training. The Random Forest Classifier performed better on all folds.
I then concluded using optuna with XGBoost and CatBoost Classifiers. As expected, the CatBoost Classifier yielded the best results out of all the models attempted.

(As a potential point to try in the future, I wonder how well the models would perform if including some newly sampled data to balance the records on the class targets. SMOTE is a potential method to perform this).

Model Performance

The Random Forest and CatBoost classifier models had the best two performances, with CatBoost being the top model. These models performed a little better over the previous models of Support Vector Machine, K-Nearest Neighbors, and Logistic Regression Models attempted. Below are the recorded Weighted F1 Scores and Accuracies for each of the models performed:

Support Vector Machine F1 Score: 91%, Accuracy: 91.54% --- with SMOTE F1 Score: 96%, Accuracy: 96.46%
K-Nearest Neighbors F1 Score: 92%, Accuracy: 92.11% --- with SMOTE F1 Score: 98%, Accuracy: 97.91%
Logistic Regression F1 Score: 91%, Accuracy: 90.60% --- with SMOTE F1 Score: 87%, Accuracy: 86.71%
Random Forest Classifier F1 Score: 94.40%, Accuracy: 94.40% --- with SMOTE F1 Score: 97.82%, Accuracy: 97.82%
AdaBoost Classifier F1 Score: 84.24%, Accuracy: 84.24% --- with SMOTE F1 Score: 85.80%, Accuracy: 85.80%
XGBoost Classifier F1 Score: 92.24%, Accuracy: 92% --- with SMOTE F1 Score: 97.48%, Accuracy: 97%
CatBoost Classifier F1 Score: 95.06%, Accuracy: 95.06% --- with SMOTE F1 Score: 98.99%, Accuracy: 98.99%

I used Optuna with XGBoost and CatBoost to build an optimized model since these algorthms include a myriad of attributes to test when trying to optimize them (hyperparameters).

Productionization

I lasted created a Flask API hosted on a local webserver. For this step I primarily followed the productionization step from the YouTube tutorial series found in the refernces above. This endpoint could be used to access, given cardiotocography exam results, a prediction for whether that pregnancy has normal health conditions or a pathological concern.

UPDATE: Below is an example prediction using the Flask API:

Future Improvements

In the case we do not wish to use artificial data, we can simply remove the SMOTE implementation and stratify the dependent variable when splitting the data into train and test sets. From here, we can run the same models knowing the data is genuine.

5/27/2022: Applying LDA using scaled data instead of unscaled data.

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
FlaskAPI		FlaskAPI
Fetal Health Classification - EDA .ipynb		Fetal Health Classification - EDA .ipynb
Fetal Health Classification - EDA .py		Fetal Health Classification - EDA .py
Fetal Health Classification - Model Building - v2.ipynb		Fetal Health Classification - Model Building - v2.ipynb
Fetal Health Classification - Model Building - v2.py		Fetal Health Classification - Model Building - v2.py
Fetal Health Classification - Model Building - v3-no SMOTE.ipynb		Fetal Health Classification - Model Building - v3-no SMOTE.ipynb
Fetal Health Classification - Model Building - v3.ipynb		Fetal Health Classification - Model Building - v3.ipynb
README.md		README.md
archive (3).zip		archive (3).zip
catboost-roc_updated.png		catboost-roc_updated.png
data-prediction-example.png		data-prediction-example.png
fetal_health.csv		fetal_health.csv
fetal_homepage.png		fetal_homepage.png
fetal_prediction.png		fetal_prediction.png
lda-visual_updated.png		lda-visual_updated.png
pairplot-pathological-pattern.png		pairplot-pathological-pattern.png
pca-patterns.png		pca-patterns.png
prolonged-to-baseline.png		prolonged-to-baseline.png
smote_study.dump		smote_study.dump
study.dump		study.dump

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fetal Health Condition Classifier - Overview:

Code and Resources Used:

References:

Data Cleaning

EDA

Model Building

Model Performance

Productionization

Future Improvements

About

Uh oh!

Releases

Packages

Languages

elayer/Fetal-Health-Classifier-Project

Folders and files

Latest commit

History

Repository files navigation

Fetal Health Condition Classifier - Overview:

Code and Resources Used:

References:

Data Cleaning

EDA

Model Building

Model Performance

Productionization

Future Improvements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages