Skip to content

qtracie/diabetes-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Status

Diabetes Classification: Predictive Analytics for Employee Health 🏥

This project analyzes health data to develop a predictive model that predicts individuals at diabetes risk.
Tech: Python, pandas, scikit-learn, matplotlib, seaborn.
Skills: Data Cleaning & Preprocessing; Exploratory Data Analysis (EDA) & Visualization; Handling Class Imbalance (SMOTE Oversampling); Model Training & Evaluation.

📌 Project Overview

Type 2 Diabetes Mellitus (DM2) creates a massive economic impact in the U.S., with a total annual economic burden of $404B as of 2017.
This project utilizes the CDC Health Indicators Dataset to build a predictive model aimed at the early identification of individuals at risk of undiagnosed diabetes.

The primary objective is to support employers in targeting prevention resources to reduce long-term medical costs and improve workforce health outcomes.

📊 Dataset Description

  • Source: CDC Diabetes Health Indicators via Kaggle.
  • Size: 253,680 rows and 22 columns.
  • Target Variable: Diabetes_binary (0 = No, 1 = Yes).
  • Class Distribution: The dataset is highly imbalanced, with only 13.9% (35,346 cases) of respondents diagnosed with diabetes.

Target Distribution

Figure 1: Distribution of Diagnosed Diabetes showing significant class imbalance (13.9% positive cases).

🔍 Key Insights from Exploratory Data Analysis (EDA)

  • Clinical Predictors: Strong associations were found between diabetes and factors such as hypertension, hypercholesterolemia, high BMI, and sedentary lifestyles.
  • Physical vs. Mental Health: While both are correlated, physical health proved to be a much stronger discriminator for diabetes risk than mental health.
  • Interaction Effects: Our analysis confirmed that the effect of mobility issues (DiffWalk) on diabetes risk is heavily dependent on the patient's age.
  • Unexpected Findings: Heavy alcohol consumption appeared inversely related to diabetes risk in this specific dataset.

Risk Factors Figure 2: Analysis of BMI and physical health as primary discriminators for diabetes risk.

⚙️ Model Strategy & Evaluation

  • Data Partition: 60% Train / 20% Validation / 20% Test.
  • Primary Metric: Recall (Class 1). In a healthcare context, False Negatives are dangerous; we prioritized maximizing the detection of actual diabetic cases.
  • Handling Imbalance: Applied SMOTE (Oversampling) on the training set to balance the classes (~130,000 samples each), allowing the model to learn minority-class patterns effectively.

Final Model Performance

Figure 3: Confusion Matrix for the final Logistic Regression model with SMOTE.

Note: SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic data points for the minority class to prevent the model from being biased toward the majority class.

🏆 Best Model Performance

Model Comparison

Figure 4: Recall comparison across different models; Oversampling significantly improved detection.

The Logistic Regression Model with Oversampling was selected as the final model for screening purposes.

Metric Training Set Validation Set Test Set
Precision (Class 1) 0.74 0.31 0.32
Recall (Class 1) 0.78 0.76 0.77
F1-Score (Class 1) 0.76 0.44 0.45
Accuracy 0.75 0.73 0.73
AUC 0.83 0.83 0.82

Note: While Precision dropped to ~0.32, this trade-off is acceptable for a screening tool where identifying a sick person is more critical than re-testing a healthy one.

💡 Business Recommendations

  • Targeted Screening: Prioritize employees with High Blood Pressure, High Cholesterol, and High BMI for screening programs.
  • Wellness Investment: Subsidize nutrition coaching and meal plan support to combat strong lifestyle predictors.
  • Mobility Support: Provide ergonomic workstations and physical therapy, especially for the older workforce experiencing movement limitations.
  • Financial Support: Lower copays for screenings and offer free annual checks for lower-income groups to reduce barriers to care.

🛠️ How to Run This Project

Follow these steps to reproduce the analysis on your local machine:

1. Clone the Repository

git clone [https://github.com/qtracie/diabetes-prediction.git](https://github.com/qtracie/diabetes-prediction.git)
cd diabetes-prediction

2. Install Required Libraries

pip install -r requirements.txt

3. Execute the script:

python notebooks/diabetes_classification_analysis.py

6. View the Final Report

For a detailed business and technical summary, refer to the PDF in the reports/ directory:

reports/Group 2B - Diabetes Classification.pdf

Releases

No releases published

Packages

 
 
 

Contributors

Languages