Diabetes multiclassification using different machine learning algorithms such as Logistic Regression, Decision Trees, Random Forest and LightGBM
Make sure you have python downloaded if you haven't already. Follow these steps to set up the environment and run the application:
- Clone the Repository:
git clone https://github.com/Sambonic/diabetes-multiclassificationcd diabetes-multiclassification- Create a Python Virtual Environment:
python -m venv env- Activate the Virtual Environment:
-
On Windows:
env\Scripts\activate -
On macOS and Linux:
source env/bin/activate
- Ensure Pip is Up-to-Date:
python.exe -m pip install --upgrade pip
-
Install Dependencies:
pip install -r requirements.txt
-
Import Diabetes Multiclassification as shown below.
To utilize this diabetes classification project:
-
Run the notebook: Execute the Jupyter Notebook (
diabetes_classification_ml.ipynb). The notebook will perform the following actions automatically:- Load the diabetes dataset.
- Perform exploratory data analysis (EDA), including data type checking, statistical analysis, visualization of missing values and class distributions, and correlation analysis.
- Handle missing values using mode imputation for categorical features and median imputation for numerical features. Evaluate different imputation methods.
- Handle outliers in the 'BMI' feature using IQR.
- Discretize the 'BMI' feature into meaningful categories.
- Balance the dataset using undersampling and oversampling techniques (SMOTENC).
- Perform feature selection using Chi-squared test and Random Forest feature importance.
- Train several classification models (Random Forest, Decision Tree, LightGBM, Logistic Regression) with and without feature selection and hyperparameter tuning.
- Evaluate model performance using various metrics (accuracy, precision, recall, F1-score) and visualize results using learning curves, confusion matrices, and ROC curves.
- Compare different model performance.
-
Interpret results: The notebook will generate various visualizations and metrics that show the performance of different models under different conditions (with/without feature selection, with/without hyperparameter tuning). Based on the results, one can determine which model performs best for diabetes classification.
- Diabetes Multi-classification: Predicts diabetes severity (no diabetes, pre-diabetes, diabetes) using machine learning.
- Data Preprocessing: Handles missing values using mean/mode imputation and outlier adjustments, and explores different imputation strategies (KNN, mean/median/mode).
- Data Balancing: Addresses class imbalance using undersampling of the majority class and oversampling of minority classes with SMOTENC (handling categorical features).
- Feature Selection: Employs Chi-squared test and Random Forest feature importance to select relevant features.
- Model Training: Trains and evaluates multiple classification models: Random Forest, Decision Tree, LightGBM, and Logistic Regression.
- Model Evaluation: Uses various metrics (accuracy, precision, recall, F1-score) and visualizes results with confusion matrices and ROC curves.
- Hyperparameter Tuning: Optimizes model hyperparameters using
RandomizedSearchCV. - Learning Curve Analysis: Plots learning curves to assess model bias and variance.
- Comparative Analysis: Compares model performance with and without feature selection and hyperparameter tuning across multiple algorithms.