A complete machine learning pipeline for classifying emails as Ham (legitimate), Spam, or Phishing using multiple datasets.
This project combines data preprocessing, TF-IDF vectorization, SMOTE oversampling, and Random Forest classification, and provides a FastAPI backend for real-time predictions.
With the rise of phishing and spam emails, automated detection is crucial for cybersecurity.
This project:
- Loads and preprocesses multiple email datasets
- Converts email text into TF-IDF vectors
- Handles imbalanced classes using SMOTE
- Trains a Random Forest classifier
- Exposes the trained model through a FastAPI REST API
It is suitable for AI/ML practitioners, cybersecurity enthusiasts, and anyone interested in real-time email classification.
- 📥 Combines datasets:
SpamAssassin,CEAS_08,Nazario,Nigerian_Fraud - 🧹 Cleans and merges subject & body of emails
- 📊 Converts text to numerical features with TF-IDF
- 🔁 Balances data with SMOTE
- 🤖 Trains a Random Forest classifier
- ⚡ Serves predictions through FastAPI endpoints
- 📈 Generates confusion matrix and classification metrics
phishing_spam_api/
│
├── app.py # FastAPI application
├── model/
│ ├── model.pkl # Trained model (generated via notebook)
│ ├── vectorizer.pkl # TF-IDF Vectorizer (generated via notebook)
│ └── train_model.py # Script to train and save model/vectorizer
├── data/ # Original CSV datasets
├── phishing-and-spam-detection.ipynb # Notebook for training & downloading artifacts
├── requirements.txt
└── README.mdgit clone https://github.com/Bilal-73/Phishing-and-Spam-Detection.git
cd Phishing-and-Spam-Detection
The trained model and vectorizer are not included due to file size. You can generate them using the Jupyter Notebook
Bilal Imran
- 💼 AI / ML & Full-Stack Enthusiast
- 🔗 GitHub: https://github.com/Bilal-73
⭐ Show Your Support If you found this project useful:
- ⭐ Star the repository
- 🍴 Fork it
- 💡 Suggest improvements