Author: Prashanna Raj Pandit
This repo contains all four major labs completed as part of the course “Machine Learning. Each lab focuses on a different machine learning technique, giving hands-on experience with supervised and unsupervised learning using R. STAT-562 is centered on applying classical statistical learning methods to real datasets. Across the four labs, we explore:
- 🔹 Data preprocessing & feature engineering
- 🔹 Classification models (k-NN, LDA, QDA, Naive Bayes)
- 🔹 Unsupervised learning (hierarchical & K-means clustering)
- 🔹 Ensemble methods (Bagging, Boosting, Random Forest)
- 🔹 Model evaluation using accuracy, ROC, confusion matrices, RMSE
- 🔹 Cross-validation & hyperparameter tuning using
caret
Each project builds practical intuition and technical skills for applying statistical models to real-world data.
This project builds and evaluates multiple machine learning models to predict breast cancer (Cancer vs Control) using routine blood-based metabolic biomarkers and anthropometric measures instead of imaging or genetic tests.
Models compared:
- Naive Bayes
- Linear Discriminant Analysis (LDA)
- k-NN (with tuned k)
- Random Forest
- Gradient Boosting
- Support Vector Machine (SVM)
- Deep Neural Network (DNN)
| Model | Accuracy | Sensitivity | Specificity | F1 Score | AUC | TP | TN | FP | FN |
|---|---|---|---|---|---|---|---|---|---|
| Naive Bayes | 0.70 | 0.77 | 0.60 | 0.74 | 0.70 | 10 | 6 | 4 | 3 |
| LDA | 0.78 | 0.85 | 0.70 | 0.82 | 0.80 | 11 | 7 | 3 | 2 |
| KNN (k tuned) | 0.78 | 0.77 | 0.80 | 0.80 | 0.81 | 10 | 8 | 2 | 3 |
| Random Forest | 0.87 | 0.85 | 0.90 | 0.88 | 0.91 | 11 | 9 | 1 | 2 |
| Gradient Boosting | 0.83 | 0.77 | 0.90 | 0.83 | 0.89 | 10 | 9 | 1 | 3 |
| SVM | 0.78 | 0.69 | 0.90 | 0.78 | 0.85 | 9 | 9 | 1 | 4 |
| Deep NN | 0.74 | 0.77 | 0.70 | 0.77 | 0.75 | 10 | 7 | 3 | 3 |
Table 1. Test-set performance for each model. Accuracy, Sensitivity (TPR), Specificity (TNR), F1, and AUC are shown, along with confusion matrix counts (TP, TN, FP, FN).