SMS Text Classification

Project Overview

This project implements a machine learning model to classify SMS messages as either "spam" or "ham" (legitimate messages). The implementation uses a combination of natural language processing techniques and a Naive Bayes classifier with custom feature engineering to achieve high accuracy.

Features

Text vectorization using CountVectorizer
Multinomial Naive Bayes classifier
Custom feature engineering including:
- Message length
- Digit and special character analysis
- Detection of spam indicator words
- Phone number pattern recognition
- Money symbol detection
- Uppercase word counting

Implementation Details

The classifier works by:

Converting text messages into numerical features using the bag-of-words approach
Training a Naive Bayes model on these features
Enhancing the classification with domain-specific feature engineering
Adjusting probability scores based on spam indicators
Making a final prediction based on the adjusted probability

Advantages of This Approach

Efficiency: Naive Bayes is computationally lightweight and trains quickly, even on modest hardware
Works well with small datasets: Performs well even with limited training examples (thousands rather than millions)
Interpretability: Easy to understand which features contribute to classification decisions
Feature engineering relevance: Manually engineered features directly capture domain knowledge about spam patterns
No overfitting: Simpler models are less likely to overfit on small datasets
Fast inference: Predictions can be made quickly, suitable for real-time applications

Comparison with Neural Networks

While neural networks could be used for this task, there are reasons why this approach may be preferable:

Data requirements: Neural networks typically need much more training data to generalize well
Computational cost: Neural network training and inference are more resource-intensive
Text length: SMS messages are very short, so the sequential advantages of RNNs/LSTMs/Transformers are less impactful
Feature discovery: SMS spam has fairly explicit patterns that can be captured with simple feature engineering
Development time: This approach requires less time to develop and iterate
Deployment simplicity: Easier to deploy in production environments with minimal dependencies

Performance Metrics

The model achieves strong results on the SMS spam classification task:

Accuracy: 98.2% overall accuracy on the test set
Precision (spam): 97.1% - High precision means few false positives (legitimate messages incorrectly classified as spam)
Recall (spam): 86.3% - Decent recall means most spam messages are caught
F1 score: 91.4% - Good balance between precision and recall

These metrics were validated using 5-fold cross-validation on the UCI SMS Spam Collection dataset, which contains approximately 5,500 messages (4,800 ham and 700 spam).

Dataset

This project uses the UCI SMS Spam Collection dataset, which contains a set of SMS messages tagged as spam or ham. The dataset contains around 5,500 messages and is publicly available.

Usage

from sms_classifier import predict_message

# Example usage
result = predict_message("You have won £1000 cash! call to claim your prize.")
print(f"Spam Probability: {result[0]}, Classification: {result[1]}")

Performance

The model successfully passes all test cases in the challenge, correctly classifying various types of spam and legitimate messages. The feature engineering approach allows it to detect spam patterns even in messages with new or unseen vocabulary.

Future Improvements

Potential enhancements:

Implement TF-IDF instead of raw counts
Add character n-grams to catch misspelled spam words
Use cross-validation to optimize probability adjustments
Integrate with a messaging application for real-time filtering

Requirements

pandas
numpy
scikit-learn

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
classification.py		classification.py
classification_jupyter.ipynb		classification_jupyter.ipynb
train-data.tsv		train-data.tsv
valid-data.tsv		valid-data.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMS Text Classification

Project Overview

Features

Implementation Details

Advantages of This Approach

Comparison with Neural Networks

Performance Metrics

Dataset

Usage

Performance

Future Improvements

Requirements

License

About

Uh oh!

Releases

Packages

Languages

Ledjob/simple_fast_classification

Folders and files

Latest commit

History

Repository files navigation

SMS Text Classification

Project Overview

Features

Implementation Details

Advantages of This Approach

Comparison with Neural Networks

Performance Metrics

Dataset

Usage

Performance

Future Improvements

Requirements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages