An end-to-end NLP project to classify SMS messages as Spam or Ham (Not Spam) using traditional machine learning, TF–IDF features, and a Streamlit web interface.
This project demonstrates a full ML workflow:
- Data ingestion from Kaggle’s SMS Spam Collection Dataset
- Text preprocessing with NLTK (cleaning, stopword removal, stemming)
- Feature extraction using TF–IDF
- Model training & evaluation with Logistic Regression (and optional Naive Bayes)
- Model persistence with
joblib - Interactive web app built with Streamlit
- Deployment-ready for Streamlit Community Cloud
- Name: SMS Spam Collection Dataset
- Source (Kaggle): https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
- Instances: ~5.5k SMS messages labeled as
hamorspam
Please refer to the dataset page for licensing and citation details.
- Language: Python
- Libraries:
pandas,numpyscikit-learn(TF–IDF, Logistic Regression, Naive Bayes, metrics)nltk(stopwords, stemming)joblib(model serialization)streamlit(web app)
sms-spam-classifier/
├── app.py # Streamlit app
├── requirements.txt # Python dependencies
├── README.md # Project documentation
├── data/
│ └── spam.csv # Kaggle dataset (placed here by you)
├── models/
│ ├── spam_model.pkl # Trained Logistic Regression model
│ └── vectorizer.pkl # TF–IDF vectorizer
└── notebooks/
└── sms_spam_classifier.ipynb # Training & evaluation notebook