A machine learning-powered web application that detects phishing emails using Natural Language Processing and Logistic Regression. Built with Python, Streamlit, and scikit-learn.
- Real-time Detection: Instantly classify emails as Safe or Phishing
- High Accuracy: Trained on 175,000+ email samples
- Web Interface: User-friendly Streamlit application
- Confidence Scoring: Shows prediction confidence percentages
- Modular Design: Separate scripts for data processing, training, and prediction
This project implements a complete machine learning pipeline for phishing email detection:
- Data Processing: Clean and preprocess email dataset
- Feature Engineering: TF-IDF vectorization for text analysis
- Model Training: Logistic Regression classifier
- Web Deployment: Interactive Streamlit application
- Standalone Prediction: Command-line prediction script
- Python3
- Machine Learning: scikit-learn, pandas, numpy
- Web Framework: Streamlit
- Text Processing: TF-IDF Vectorization
- Model Persistence: joblib
Phishing Detector/
βββ app.py # Streamlit web application
βββ data_checker.py # Data exploration script
βββ data_cleaning.py # Data preprocessing
βββ model_train.py # Model training script
βββ predict.py # Standalone prediction
βββ Phishing_Email.csv # Original dataset
βββ cleaned_phishing_emails.csv # Processed dataset
βββ AI_Models/
β βββ phishing_model.joblib # Trained model
β βββ vectorizer.joblib # TF-IDF vectorizer
βββ README.md # Project documentation
pip install streamlit pandas scikit-learn joblib numpy-
Clone or download the project
git clone <repository-url> cd "Phishing Detector"
-
Install dependencies
pip install -r requirements.txt
-
Run the Streamlit app
streamlit run app.py
-
Open your browser to
http://localhost:8501 -
Paste email content and click "Check Email"
python predict.pyIf you want to retrain the model:
-
Check the dataset
python data_checker.py
-
Clean the data
python data_cleaning.py
-
Train the model
python model_train.py
- Algorithm: Logistic Regression
- Features: 7,000 TF-IDF features
- Training Data: ~140,000 emails (80% split)
- Test Data: ~35,000 emails (20% split)
- Evaluation: Accuracy, Precision, Recall, F1-Score
If this project helped you or you found it useful, consider buying me a coffee! Your support helps maintain and improve this project.
Other ways to support:
- β Star this repository
- π΄ Fork and contribute
- π’ Share with others
- π Report issues and bugs
- Removes stop words (common English words)
- Converts text to lowercase
- Handles missing data
- TF-IDF Vectorization: Converts email text to numerical features
- Term Frequency: How often words appear in each email
- Inverse Document Frequency: How rare words are across all emails
- Logistic Regression: Binary classifier (Safe vs Phishing)
- Probability Scores: Confidence percentages for predictions
Email Text β TF-IDF Transform β Model Prediction β Result + Confidence
- Paste suspicious email content
- Click "Check Email"
- Get instant result with confidence score
Phishing Email Example:
Congratulations! You've won $1000! Click here immediately to claim your prize!
Safe Email Example:
Hi John, just confirming our meeting tomorrow at 2 PM. Please let me know if this works.
- Purpose: Convert text to numbers for machine learning
- Max Features: 7,000 most important words
- Stop Words: Removes common words like "the", "and", "is"
- Why: Fast, interpretable, good for text classification
- Output: Binary classification with probability scores
- Training: 1000 max iterations for convergence
- Saved Files: Both model and vectorizer saved separately
- Consistency: Ensures same preprocessing for new predictions
- Input validation for email content
- No storage of user email data
- Secure model file handling
- Error handling for corrupted files
# In model_train.py
vectorizer = TfidfVectorizer(
stop_words='english',
max_features=7000, # Adjust number of features
ngram_range=(1, 2) # Add bigrams
)# Replace LogisticRegression with other algorithms
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)The model provides:
- Prediction confidence (probability scores)
- Classification reports during training
- Accuracy metrics on test data
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project is open source and available under the MIT License.
-
"Model files not found" error
- Run
python model_train.pyfirst to create model files
- Run
-
Import errors
- Install missing packages:
pip install package_name
- Install missing packages:
-
Dataset not found
- Ensure
Phishing_Email.csvis in the project directory
- Ensure
-
Streamlit not opening
- Check if port 8501 is available
- Try:
streamlit run app.py --server.port 8502
For questions or support, please open an issue in the repository.