A transformer-based deep learning model that classifies email messages as spam or ham using BERT (Bidirectional Encoder Representations from Transformers).
This project includes complete preprocessing, class balancing, sentence embeddings, model training, evaluation, and inference using TensorFlow & TensorFlow Hub.
๐ Project Highlights
- Built using BERT (uncased, L-12, H-768, A-12)
- Achieves 93% accuracy, ~90โ95% precision and 90โ95% recall
- Uses TF Hub BERT Preprocessing + Encoder layers
- Balanced imbalanced dataset using downsampling
- Evaluated using confusion matrix, classification report, precision, recall, F1-score
- Predicts new messages with high confidence
- Demonstrates semantic similarity using BERT embeddings
๐ Dataset
Dataset from Kaggle: SMS Spam Collection Dataset
Class distribution:
Ham: 4825 messages
Spam: 747 messages
Strong class imbalance โ handled via downsampling.
๐งน Data Preprocessing
- Removed imbalance using random downsampling
- Converted categories into binary labels (spam = 1, ham = 0)
- Train-test split using stratified sampling
๐ง Model Architecture
- BERT Pipeline
- BERT Preprocessing Layer
- BERT Encoder Layer
- Dropout
- Dense Layer (Sigmoid Activation)
Only the final layer is trainable โ BERT acts as a feature extractor.
๐ Model Performance

