Sentiment Classifier 🎬

A professional sentiment analysis system using TF-IDF vectorization and Logistic Regression for movie review classification. This project demonstrates best practices in machine learning pipeline development, achieving ~90% accuracy on the IMDB movie review dataset.

🌟 Features

Clean Architecture: Modular design with separate components for data loading, model training, and evaluation
Automated Hyperparameter Tuning: Uses RandomizedSearchCV for efficient parameter optimization
Comprehensive Evaluation: Includes confusion matrix, ROC curves, and detailed classification metrics
Easy to Use: Simple API for training and prediction
Well Documented: Extensive documentation and examples
Production Ready: Includes model serialization and loading capabilities

📁 Project Structure

Text_Classification_Model/
├── src/
│   └── sentiment_classifier/
│       ├── __init__.py           # Package initialization
│       ├── data_loader.py        # Data loading and preprocessing
│       ├── model.py              # Sentiment classifier model
│       └── evaluator.py          # Model evaluation and visualization
├── data/
│   ├── labeledTrainData.tsv     # Training dataset (not in git)
│   └── moviedata.csv             # Movie review data (not in git)
├── notebooks/
│   └── TFIDF_LR_Model.ipynb     # Exploratory analysis notebook
├── examples/
│   └── train_and_evaluate.py    # Example usage script
├── tests/                        # Unit tests (to be added)
├── docs/                         # Additional documentation
├── requirements.txt              # Project dependencies
├── setup.py                      # Package installation script
├── .gitignore                    # Git ignore rules
└── README.md                     # This file

🚀 Quick Start

Installation

Clone the repository

git clone https://github.com/yourusername/Text_Classification_Model.git
cd Text_Classification_Model

Create a virtual environment (recommended)

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```

Download NLTK stopwords

python -c "import nltk; nltk.download('stopwords')"

Install the package (optional, for development)
```
pip install -e .
```

Basic Usage

from sentiment_classifier import SentimentClassifier, DataLoader, ModelEvaluator

# Load and split data
data_loader = DataLoader('data/labeledTrainData.tsv', test_size=0.3, random_state=42)
X_train, X_test, y_train, y_test = data_loader.load_and_split()

# Create and train the model
classifier = SentimentClassifier()
classifier.train(X_train, y_train)

# Evaluate the model
evaluator = ModelEvaluator(classifier)
results = evaluator.evaluate(X_test, y_test)

# Make predictions
predictions = classifier.predict(["This movie was fantastic!", "Terrible waste of time"])
print(predictions)  # [1, 0] (1 = positive, 0 = negative)

Running the Example

python examples/train_and_evaluate.py

📊 Model Performance

The model achieves excellent performance on the IMDB movie review dataset:

Accuracy: ~90%
Precision: ~90% (balanced across classes)
Recall: ~90% (balanced across classes)
ROC AUC Score: ~0.96

Sample Results

Classification Report:
              precision    recall  f1-score   support
           0       0.91      0.88      0.90      3785
           1       0.88      0.91      0.90      3715
   avg/total       0.90      0.90      0.90      7500

ROC AUC Score: 0.9607

🔧 Model Pipeline

The classifier uses a scikit-learn Pipeline with three stages:

CountVectorizer: Converts text to token count matrix
- Removes English stopwords
- Supports n-grams (unigrams, bigrams, trigrams)
- Filters rare words using min_df parameter
TfidfTransformer: Transforms counts to TF-IDF features
- Weighs terms by importance
- Normalizes feature vectors
LogisticRegression: Binary classification
- L2 regularization to prevent overfitting
- Optimized with hyperparameter tuning

Hyperparameter Tuning

The model uses RandomizedSearchCV to efficiently search the following parameter space:

ngram_range: (1,1), (1,2), (1,3)
min_df: 2, 3, 4
C (regularization): 1, 10, 20

📓 Jupyter Notebook

For interactive exploration, check out the Jupyter notebook:

jupyter notebook notebooks/TFIDF_LR_Model.ipynb

The notebook includes:

Detailed explanations of each step
Data exploration and visualization
Model training and evaluation
Performance metrics and plots

🎯 Use Cases

This sentiment classifier can be applied to:

Movie/Product Reviews: Classify customer sentiment
Social Media Monitoring: Analyze public opinion
Customer Feedback: Automatically categorize support tickets
Market Research: Understand consumer attitudes

📚 Dataset

The model is trained on the IMDB movie review dataset:

Source: Kaggle - Word2Vec NLP Tutorial
Size: 25,000 labeled movie reviews
Balance: 50% positive, 50% negative reviews
Format: TSV file with 'sentiment' and 'review' columns

Note: Due to size, data files are not included in the repository. Download them from Kaggle and place in the data/ directory.

🛠️ Development

Running Tests

pytest tests/

Code Style

This project follows PEP 8 style guidelines. Format code using:

black src/
flake8 src/

Adding Features

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📖 Documentation

For detailed API documentation, see the docstrings in each module:

src/sentiment_classifier/data_loader.py - Data loading utilities
src/sentiment_classifier/model.py - Model implementation
src/sentiment_classifier/evaluator.py - Evaluation tools

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Dataset: Kaggle IMDB Dataset
Libraries: scikit-learn, NLTK, pandas, matplotlib
Inspiration: Text classification tutorials and best practices from the ML community

📞 Contact

For questions or feedback, please open an issue on GitHub.

🔗 Resources

Made with ❤️ by the open-source community

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Classifier 🎬

🌟 Features

📁 Project Structure

🚀 Quick Start

Installation

Basic Usage

Running the Example

📊 Model Performance

Sample Results

🔧 Model Pipeline

Hyperparameter Tuning

📓 Jupyter Notebook

🎯 Use Cases

📚 Dataset

🛠️ Development

Running Tests

Code Style

Adding Features

📖 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Contact

🔗 Resources

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
docs		docs
examples		examples
notebooks		notebooks
src/sentiment_classifier		src/sentiment_classifier
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

benstanbury/Text_Classification_Model

Folders and files

Latest commit

History

Repository files navigation

Sentiment Classifier 🎬

🌟 Features

📁 Project Structure

🚀 Quick Start

Installation

Basic Usage

Running the Example

📊 Model Performance

Sample Results

🔧 Model Pipeline

Hyperparameter Tuning

📓 Jupyter Notebook

🎯 Use Cases

📚 Dataset

🛠️ Development

Running Tests

Code Style

Adding Features

📖 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Contact

🔗 Resources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages