Classification with Machine Learning

Left: kNN grid search results on Spambase. Right: kNN grid search results on Loan.

Project Overview

Coursework project comparing classical classifiers across multiple tabular datasets (Spambase, Nursery, Loan, Breast Cancer).
Focus on how preprocessing choices (encoding, scaling, transformations) and hyperparameters influence accuracy and F1.
Evaluated decision trees, k-nearest neighbours, and logistic regression using consistent experimental pipelines.

Spambase (UCI): 4,601 emails with 57 numeric features; target labels spam vs. ham.
Nursery (UCI): 12,960 categorical records that require encoding; highly imbalanced target.
Loan & Breast Cancer: Additional classification datasets.

Exploratory analysis to understand class balance, feature distributions, and preprocessing needs.
Applied dataset-specific preprocessing (log transforms, scaling, one-hot and label encoding) to support different models.
Trained decision trees, kNN, and logistic regression with systematic grid/parameter sweeps.
Compared holdout and cross-validation metrics (accuracy, F1) and tracked runtime considerations.

No single algorithm dominated; performance depended on dataset characteristics and preprocessing.
For Spambase, kNN accuracy improved after log scaling and careful neighbour selection.
Logistic regression benefited from appropriate regularisation (C, penalty type) and solver choice, especially on high-dimensional data.
Class imbalance (e.g., Nursery) required label merging or weighting to avoid majority bias.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
docs		docs
figs		figs
notebooks		notebooks
.gitattributes		.gitattributes
README.md		README.md