Lightweight notebook that cleans the UCI-style diabetes survey data and trains an XGBoost classifier to predict the Diabetes_binary flag.
- Load data from
data/diabetes.csvand inspect missing values. - Clean: median-fill the single BMI NaN, drop one row with missing
HeartDiseaseorAttack, and mode-fillFruitsplus other sparse nulls. - Feature prep:
- Normalize BMI into bins and scale BMI, Physical_Health, Mental_Health, and Age to 0-1.
- One-hot encode categorical columns: General_Health, Sex, Education, Income.
- Split labels/features (
Diabetes_binarytarget).
- Modeling: train/test split (80/20), baseline
XGBClassifierwith tuned learning rate, depth, estimators, subsample, and early stopping. - Evaluation: accuracy, precision, recall, confusion matrix; optional GridSearchCV over learning rate, max depth, estimators, and column sampling.
- Ensure Python 3.9+ and install deps:
pip install pandas numpy scikit-learn xgboost matplotlib.
- Open and run the notebook classifier.ipynb top-to-bottom.
- Confirm
data/diabetes.csvexists relative to the notebook (place it under./data/).
- classifier.ipynb: data cleaning, feature engineering, XGBoost training, evaluation, and hyperparameter search.
- data/diabetes.csv: input dataset (not committed; provide locally).
- Hyperparameter search uses ROC AUC via
GridSearchCV; best params are reused to re-evaluate train/test scores. - Confusion matrices are plotted for both train and test sets via
plot_confusion_matrix.