This project applies supervised machine learning algorithms to predict whether a patient is diabetic based on diagnostic medical attributes. Using the Pima Indians Diabetes Dataset, we trained and evaluated multiple classification models to identify high-risk individuals and assist in early detection.
- Source: Kaggle โ Pima Indians Diabetes Database
- Features: Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, BMI, Diabetes Pedigree Function, Age
- Target: Outcome (0: Non-Diabetic, 1: Diabetic)
- โ Data cleaning & exploration
- โ
Feature scaling with
StandardScaler - โ Model training: Logistic Regression, Random Forest, and SVM
- โ
Model evaluation using:
- Accuracy, Precision, Recall, F1-Score
- Confusion Matrix & Classification Report
- ROC Curve & AUC Score
- โ Single-patient prediction with real data simulation
- โ Clean, modular, and well-commented code
| Model | Description |
|---|---|
| Logistic Regression | Interpretable baseline classifier |
| Random Forest | Ensemble method for robust predictions |
| Support Vector Machine (SVM) | Effective for small-to-medium datasets with scaling |
The Random Forest classifier showed the best performance with:
- Accuracy: ~85%
- ROC AUC Score: High discriminative power
- Balanced precision and recall, ideal for medical diagnosis
sample = np.array([[6, 148, 72, 35, 0, 33.6, 0.627, 50]])
sample_scaled = scaler.transform(sample)
prediction = model.predict(sample_scaled)- Python (
NumPy,Pandas,Scikit-Learn) Matplotlib&Seabornfor visualizations- Jupyter Notebook / Google Colab
- Hyperparameter tuning using GridSearchCV
- Model deployment with Streamlit or Flask
- Cross-validation and imputation for missing values
- Advanced models like XGBoost or LightGBM