Email Classifier

- Venkatakrishnan Parthasarathy

Problem Statement: Train a ML model to be able to classify emails into spam and non-

spam/ham with a given data set to train and validate the model.

Approach: To have a stable and good model, a series of procedure are followed. This is the

following approach taken in this assignment:

(1) Organize the data set and make it ready for preparation
(2) Go through all the data and clean them
(3) Shuffle the data and split into training and test set
(4) On the training set, extract the features that might contribute to the classification
(5) Get insights on how the data set is based on the features extracted
(6) Do a second round of cleaning based on the data extracted to make the model more
unbiased
(7) Extract the features again for training
(8) Train models based on the features and do cross validation
(9) Choose the model with the best accuracy and save it for testing
(10) Test the model for it’s accuracy and show metrics based on the results

Design of the python script:

The script should be able to execute the requested step only so it’s easier to debug and fast
Command line arguments for the step are taken to accomplish this
The output of one step is stored in a file using pickle so it can just be read by the other step by calling the program with the respective arguments
Storing the objects of intermediate steps in a file allows us to skip doing previous steps again which effectively saves a considerable amount of time

Data preparation and cleaning:

All the Ham and spam mails are put into two directories ("dataset/allspam/.txt") and ("dataset/allham/.txt")
A function “Initialize” is defined to do the preliminary cleaning steps
Traverse through the files and make a list of all the words in the email body
When looking at the content of the mails, even though there is no meaningful message, some mails just have the word “Subject:” in the starting
The first element in all the lists is the word “Subject:”. So removing it from all the messages when traversing
Dispose of emails that become empty after removing the word “subject”.
Stop words are removed from the messages to make the feature list smaller and cleaner
The list of the mails are put in a pandas data-frame with columns ‘content’ and ‘spam’ which are email body and Boolean type spam/ham respectively

The data frame is shuffled for even distribution of spam and ham across the data frame
and then split between test and training data (70%:30%)

The resulting data-sets are written to a file for the next step
Now the training data-set can be used for exploratory analysis and test data-set should not be touched

Data extraction for exploration:

Load the training data-set from file and separate ham and spam again.
The email content is now filled with list of words. Converting it to space separated sentences for use with CountVectorizer()
CountVectorizer is used to get the bag of words so we can explore the word frequencies in different types of emails
Dump the word frequencies in a file for visualization in next step.
Get the length of emails and dump is to a file so further analysis can be done on that as well

Data exploration for insights:

Load the dumped pandas series of word frequencies in spam and ham
By looking at the most frequent words in the ham and spam, we can notice quite a few words are present in both the ham and spam. The following image shows the common words highlighted in yellow among the top 30 most frequent words.
These common words would definitely not help improving the accuracy of the classification model. So we add these to the list of stop words and perform cleaning and extraction of data again to get more insights regarding the data.
Even though some words like 1, 100, 1000 are present in both the types, we are not going to be removing them. There is a possibility that smaller numbers are usually used in transnational or ham messages and numbers like 1000000000000 could possible be used more frequently in spam mails.
We will also not be removing words like “WWW” because spam email might contain URLs more frequently than ham ones.
Plotting the word frequencies after second round of cleaning, using pygal plotting library,
- If we look at the bar plots, we can notice that the word “enron” which is the name of the

company is having a very huge footprint on the ham plot. If we want to train a generic
email classifier, we would need to exclude company specific words like this.

However, we are assuming that this classification model is particularly for that company and hence for better accuracy of the model, keeping it.
The maximum number of times a frequent word occurs in a spam email is usually (
- 1. way less that ham’s (5000 – 40000)
Lets look at the length of the emails. The average length calculated is the average number of meaningful words present in a mail:

This shows the spam messages(136.15) are usually shorter than ham messages(174.39).

Plotting a pyramid plot for length of emails:

-

From the plot, we can see a few lengthy emails in HAM. Very rarely are spam mails extremely lengthy. This is quite possible because of a lot of emails are replied to / forwarded in HAM, the previous reply history is also present in such mails.

Extracting features for training:

Using TfidfVectorizer() to make normalized document term matrix.
Save the vectorizer to file for use by test set.
Generate labels for training the model. Using 0 for ham and 1 for spam
Save the fitted vectorizer and labels to be used with Training set using pickle.
We’re using “protocol = 4” in pickle dump to dump objects more that 4GB in size, as we can see the file size of the feature matrices are 4.6Gb and 9.5Gb respectively

Training:

Load the feature matrix and the labels
Make a dictionary of models so we can easily compare accuracy among different models
In LogisticRegression, we’re choosing “ovr” as multi_class because it’s optimized for binary classification(spam and ham). And so is “liblinear” solver
Train and calculate the accuracy of different models in the dictionary. We’ll be using K- Fold cross-validation with k as 15 to cover as much data as possible. We’re using cross- validation to cover as much data as possible which a set validation would have left off.

◦ Naive-Bayes: 98.67971094992707%
◦ LinearSVC: 99.15090797779823%
◦ LogisticRegression: 98.59050654038633%
◦ Model with max accuracy: LinearSVC

Compare the classification_report and accuracy of different models to pick the best model.
This snippet will automatically choose the model with the best accuracy, train it and save it to a file for testing.
Gauge chart ranging from 0.9-1.0 for comparing accuracy
All accuracy calculations done till now is on the training set by cross-validation

Testing the model:

The test data-frame which was previously saved is loaded and feature matrix is prepared just like the training set
The feature matrix is prepared on the previous Vectorizer that was saved, as the shape will change if a new one is prepared
Open the saved model file and predict the test set
Calculating the out of sample error for the model:
we get out of sample error (RMS) and accuracy:

Out of sample Error: 1.0051%
Accuracy: 98.99%

Using sklearn.metrics.classification_report to get more information about the test.
ROC Curve for the classifier:
AUC values and confusion matrix are calculated as follows:
By looking at the confusion matrix, we get the following deductions:

Number of:
Ham messages classified correctly = 4821
Spam messages classified correctly = 5172
Ham messages misclassified = 78
Spam messages misclassified = 24

For use in production, the error rate should be a maximum of 2.5% The current model might be biased towards the Enron organization, as many company specific terms are involved. So, for this particular organization this model can be used in production.
However for a generic one or for other companies, such a bias must be removed and/or trained with a very huge dataset from multiple organizations which cater to a variety of use cases.

Libraries/Tools used for the assignment:

Pandas - Used for storing mails and their classification in a data-frame
Numpy - Numpy arrays and sum operations
sklearn - For classifiers, model selection, splitting, training, metrics
pygal - plotting
matplotlib - plotting ROC graph
nltk - For getting stopwords
Pickle - Saving and loading files
VSCode - IDE and debugging

APPENDIX – Running the python script

Initial steps:

Place all the pre-processed ham files in dataset/allham/ directory and spam files in dataset/allspam/ directory. There are two approaches:

If you have VSCode installed: It’s fairly simple as there is an attached launch.json file attached. Open the directory in vscode and start debugging. Choose the step you want to execute. Previous steps must be executed in order to get the required files for processing the latter.
If not: You need to pass command line arguments(which is the step) to the python file.

Eg:,

NOTE: This might not work on Mac OSX because of it not being able to resolve relative
paths.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.vscode		.vscode
dataset		dataset
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Email Classifier

Problem Statement: Train a ML model to be able to classify emails into spam and non-

Approach: To have a stable and good model, a series of procedure are followed. This is the

Design of the python script:

Data preparation and cleaning:

Data extraction for exploration:

Data exploration for insights:

-

Extracting features for training:

Training:

Testing the model:

Libraries/Tools used for the assignment:

APPENDIX – Running the python script

About

Uh oh!

Releases

Packages

Languages

xeonel2/emailclassifier

Folders and files

Latest commit

History

Repository files navigation

Email Classifier

Problem Statement: Train a ML model to be able to classify emails into spam and non-

Approach: To have a stable and good model, a series of procedure are followed. This is the

Design of the python script:

Data preparation and cleaning:

Data extraction for exploration:

Data exploration for insights:

-

Extracting features for training:

Training:

Testing the model:

Libraries/Tools used for the assignment:

APPENDIX – Running the python script

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages