This project implements Support Vector Machines (SVMs) for both linear and non-linear classification. It also builds a spam email classifier using SVMs with text preprocessing.
SVMs find the optimal separating hyperplane that maximizes the margin between classes. Key components:
- Linear kernel: For linearly separable data
- Gaussian (RBF) kernel:
K(x1, x2) = exp(-||x1-x2||^2 / (2*sigma^2))for non-linear boundaries - C parameter: Controls the penalty for misclassification (analogous to 1/lambda)
Emails are preprocessed (lowercasing, URL normalization, stemming, etc.) and converted to feature vectors. A linear SVM is trained on these features to classify spam vs. non-spam.
| File | Description |
|---|---|
sample6.m |
Main script: SVM with linear and Gaussian kernels |
sample6_spam.m |
Main script: spam email classification |
svmTrain.m |
SVM training using SMO algorithm |
svmPredict.m |
SVM prediction |
gaussianKernel.m |
Gaussian (RBF) kernel function |
linearKernel.m |
Linear kernel function |
dataset3Params.m |
Cross-validation for C and sigma selection |
visualizeBoundary.m |
Plots non-linear decision boundary |
visualizeBoundaryLinear.m |
Plots linear decision boundary |
processEmail.m |
Email text preprocessing |
emailFeatures.m |
Converts word indices to feature vector |
getVocabList.m |
Loads the vocabulary list |
porterStemmer.m |
Porter stemming algorithm |
readFile.m |
Reads file contents |
ex6data[1-3].mat |
2D classification datasets |
spamTrain.mat, spamTest.mat |
Spam classification datasets |
vocab.txt |
Vocabulary list (1899 words) |
emailSample[1-2].txt |
Sample legitimate emails |
spamSample[1-2].txt |
Sample spam emails |
- Linear SVM: Correctly separates linearly separable data
- Gaussian SVM: Achieves non-linear decision boundaries for complex datasets
- Spam Classifier: Training accuracy: 99.85%, Test accuracy: 98.9%
- Top spam indicators: "our", "click", "remov", "guarante", "visit"
Left: Linear SVM with margin boundaries. Center: Non-linear RBF kernel SVM. Right: Gaussian kernel function for different sigma values.
Exercises from Andrew Ng's Machine Learning course on Coursera, completed by Keivan Hassani Monfared.
