Skip to content

Amir79Naziri/DiabetesClassification_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Diabetes Classification (XGBoost)

Lightweight notebook that cleans the UCI-style diabetes survey data and trains an XGBoost classifier to predict the Diabetes_binary flag.

What the notebook does

  • Load data from data/diabetes.csv and inspect missing values.
  • Clean: median-fill the single BMI NaN, drop one row with missing HeartDiseaseorAttack, and mode-fill Fruits plus other sparse nulls.
  • Feature prep:
    • Normalize BMI into bins and scale BMI, Physical_Health, Mental_Health, and Age to 0-1.
    • One-hot encode categorical columns: General_Health, Sex, Education, Income.
    • Split labels/features (Diabetes_binary target).
  • Modeling: train/test split (80/20), baseline XGBClassifier with tuned learning rate, depth, estimators, subsample, and early stopping.
  • Evaluation: accuracy, precision, recall, confusion matrix; optional GridSearchCV over learning rate, max depth, estimators, and column sampling.

How to run

  1. Ensure Python 3.9+ and install deps:
    • pip install pandas numpy scikit-learn xgboost matplotlib.
  2. Open and run the notebook classifier.ipynb top-to-bottom.
  3. Confirm data/diabetes.csv exists relative to the notebook (place it under ./data/).

Key files

  • classifier.ipynb: data cleaning, feature engineering, XGBoost training, evaluation, and hyperparameter search.
  • data/diabetes.csv: input dataset (not committed; provide locally).

Notes

  • Hyperparameter search uses ROC AUC via GridSearchCV; best params are reused to re-evaluate train/test scores.
  • Confusion matrices are plotted for both train and test sets via plot_confusion_matrix.

About

Implementing a classifier for diabetes dataset with XGBOOST.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published