A machine learning project to classify emails as spam or ham (not spam) using natural language processing (NLP). The model uses techniques like TF-IDF vectorization and compares multiple classification algorithms including Logistic Regression, Naive Bayes, and Support Vector Machines (SVM).
- Text preprocessing with TF-IDF
- Trained and validated on the
completespamassasindataset - Compared multiple models: Logistic Regression, Naive Bayes, and SVM
- Best performance with SVM (98%+ accuracy)
- Save and reuse model for prediction
- CLI script to test your own email
- Source: Kaggle - arXiv Spam Dataset
- File used:
completespamassasin.csv - Columns:
Body: The content of the emailLabel: 0 = Ham, 1 = Spam
- Python 3.8+
- Virtual environment (optional but recommended)
pip install -r requirements.txtIf
requirements.txtis not provided, you can install manually:
pip install pandas numpy scikit-learn matplotlib seaborn- Open the Jupyter notebook
spam_classifier.ipynbor your Python script. - Load and preprocess the dataset.
- Vectorize text using TF-IDF.
- Train using the best model (e.g., SVM).
- Save the model:
import joblib
joblib.dump(svm_model, 'spam_classifier_model.pkl')
joblib.dump(vectorizer, 'tfidf_vectorizer.pkl')Run the classify_email.py script to test any email string:
python classify_email.py "Win a brand new iPhone now! Click here to claim."Prediction: SPAM
This uses the saved model and vectorizer (.pkl files).
| Model | Accuracy |
|---|---|
| Logistic Regression | 96.6% |
| Naive Bayes | 89.7% |
| SVM (Best) | 98.6% |
Feel free to fork this repo and improve on:
- Text preprocessing (lemmatization, stemming)
- Adding Flask or Streamlit UI
- Deploying to web
Author: [Ndongmo Christian] Email: [christianhonore2003@gmail.com] GitHub: @ndongchrist