Diabetes Classification (XGBoost)

Lightweight notebook that cleans the UCI-style diabetes survey data and trains an XGBoost classifier to predict the Diabetes_binary flag.

What the notebook does

Load data from data/diabetes.csv and inspect missing values.
Clean: median-fill the single BMI NaN, drop one row with missing HeartDiseaseorAttack, and mode-fill Fruits plus other sparse nulls.
Feature prep:
- Normalize BMI into bins and scale BMI, Physical_Health, Mental_Health, and Age to 0-1.
- One-hot encode categorical columns: General_Health, Sex, Education, Income.
- Split labels/features (Diabetes_binary target).
Modeling: train/test split (80/20), baseline XGBClassifier with tuned learning rate, depth, estimators, subsample, and early stopping.
Evaluation: accuracy, precision, recall, confusion matrix; optional GridSearchCV over learning rate, max depth, estimators, and column sampling.

Ensure Python 3.9+ and install deps:
- pip install pandas numpy scikit-learn xgboost matplotlib.
Open and run the notebook classifier.ipynb top-to-bottom.
Confirm data/diabetes.csv exists relative to the notebook (place it under ./data/).

classifier.ipynb: data cleaning, feature engineering, XGBoost training, evaluation, and hyperparameter search.
data/diabetes.csv: input dataset (not committed; provide locally).

Hyperparameter search uses ROC AUC via GridSearchCV; best params are reused to re-evaluate train/test scores.
Confusion matrices are plotted for both train and test sets via plot_confusion_matrix.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
classifier.ipynb		classifier.ipynb