Left: kNN grid search results on Spambase. Right: kNN grid search results on Loan.
- Coursework project comparing classical classifiers across multiple tabular datasets (Spambase, Nursery, Loan, Breast Cancer).
- Focus on how preprocessing choices (encoding, scaling, transformations) and hyperparameters influence accuracy and F1.
- Evaluated decision trees, k-nearest neighbours, and logistic regression using consistent experimental pipelines.
- Spambase (UCI): 4,601 emails with 57 numeric features; target labels spam vs. ham.
- Nursery (UCI): 12,960 categorical records that require encoding; highly imbalanced target.
- Loan & Breast Cancer: Additional classification datasets.
- Exploratory analysis to understand class balance, feature distributions, and preprocessing needs.
- Applied dataset-specific preprocessing (log transforms, scaling, one-hot and label encoding) to support different models.
- Trained decision trees, kNN, and logistic regression with systematic grid/parameter sweeps.
- Compared holdout and cross-validation metrics (accuracy, F1) and tracked runtime considerations.
- No single algorithm dominated; performance depended on dataset characteristics and preprocessing.
- For Spambase, kNN accuracy improved after log scaling and careful neighbour selection.
- Logistic regression benefited from appropriate regularisation (C, penalty type) and solver choice, especially on high-dimensional data.
- Class imbalance (e.g., Nursery) required label merging or weighting to avoid majority bias.
notebooks/: Jupyter notebooks for EDA and model training.data/: Compressed source datasets.docs/: Project instructions and reports.figs/: Result figures.

