A professional sentiment analysis system using TF-IDF vectorization and Logistic Regression for movie review classification. This project demonstrates best practices in machine learning pipeline development, achieving ~90% accuracy on the IMDB movie review dataset.
- Clean Architecture: Modular design with separate components for data loading, model training, and evaluation
- Automated Hyperparameter Tuning: Uses RandomizedSearchCV for efficient parameter optimization
- Comprehensive Evaluation: Includes confusion matrix, ROC curves, and detailed classification metrics
- Easy to Use: Simple API for training and prediction
- Well Documented: Extensive documentation and examples
- Production Ready: Includes model serialization and loading capabilities
Text_Classification_Model/
βββ src/
β βββ sentiment_classifier/
β βββ __init__.py # Package initialization
β βββ data_loader.py # Data loading and preprocessing
β βββ model.py # Sentiment classifier model
β βββ evaluator.py # Model evaluation and visualization
βββ data/
β βββ labeledTrainData.tsv # Training dataset (not in git)
β βββ moviedata.csv # Movie review data (not in git)
βββ notebooks/
β βββ TFIDF_LR_Model.ipynb # Exploratory analysis notebook
βββ examples/
β βββ train_and_evaluate.py # Example usage script
βββ tests/ # Unit tests (to be added)
βββ docs/ # Additional documentation
βββ requirements.txt # Project dependencies
βββ setup.py # Package installation script
βββ .gitignore # Git ignore rules
βββ README.md # This file
-
Clone the repository
git clone https://github.com/yourusername/Text_Classification_Model.git cd Text_Classification_Model -
Create a virtual environment (recommended)
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Download NLTK stopwords
python -c "import nltk; nltk.download('stopwords')"
-
Install the package (optional, for development)
pip install -e .
from sentiment_classifier import SentimentClassifier, DataLoader, ModelEvaluator
# Load and split data
data_loader = DataLoader('data/labeledTrainData.tsv', test_size=0.3, random_state=42)
X_train, X_test, y_train, y_test = data_loader.load_and_split()
# Create and train the model
classifier = SentimentClassifier()
classifier.train(X_train, y_train)
# Evaluate the model
evaluator = ModelEvaluator(classifier)
results = evaluator.evaluate(X_test, y_test)
# Make predictions
predictions = classifier.predict(["This movie was fantastic!", "Terrible waste of time"])
print(predictions) # [1, 0] (1 = positive, 0 = negative)python examples/train_and_evaluate.pyThe model achieves excellent performance on the IMDB movie review dataset:
- Accuracy: ~90%
- Precision: ~90% (balanced across classes)
- Recall: ~90% (balanced across classes)
- ROC AUC Score: ~0.96
Classification Report:
precision recall f1-score support
0 0.91 0.88 0.90 3785
1 0.88 0.91 0.90 3715
avg/total 0.90 0.90 0.90 7500
ROC AUC Score: 0.9607
The classifier uses a scikit-learn Pipeline with three stages:
-
CountVectorizer: Converts text to token count matrix
- Removes English stopwords
- Supports n-grams (unigrams, bigrams, trigrams)
- Filters rare words using min_df parameter
-
TfidfTransformer: Transforms counts to TF-IDF features
- Weighs terms by importance
- Normalizes feature vectors
-
LogisticRegression: Binary classification
- L2 regularization to prevent overfitting
- Optimized with hyperparameter tuning
The model uses RandomizedSearchCV to efficiently search the following parameter space:
- ngram_range: (1,1), (1,2), (1,3)
- min_df: 2, 3, 4
- C (regularization): 1, 10, 20
For interactive exploration, check out the Jupyter notebook:
jupyter notebook notebooks/TFIDF_LR_Model.ipynbThe notebook includes:
- Detailed explanations of each step
- Data exploration and visualization
- Model training and evaluation
- Performance metrics and plots
This sentiment classifier can be applied to:
- Movie/Product Reviews: Classify customer sentiment
- Social Media Monitoring: Analyze public opinion
- Customer Feedback: Automatically categorize support tickets
- Market Research: Understand consumer attitudes
The model is trained on the IMDB movie review dataset:
- Source: Kaggle - Word2Vec NLP Tutorial
- Size: 25,000 labeled movie reviews
- Balance: 50% positive, 50% negative reviews
- Format: TSV file with 'sentiment' and 'review' columns
Note: Due to size, data files are not included in the repository. Download them from Kaggle and place in the data/ directory.
pytest tests/This project follows PEP 8 style guidelines. Format code using:
black src/
flake8 src/- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
For detailed API documentation, see the docstrings in each module:
src/sentiment_classifier/data_loader.py- Data loading utilitiessrc/sentiment_classifier/model.py- Model implementationsrc/sentiment_classifier/evaluator.py- Evaluation tools
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
- Dataset: Kaggle IMDB Dataset
- Libraries: scikit-learn, NLTK, pandas, matplotlib
- Inspiration: Text classification tutorials and best practices from the ML community
For questions or feedback, please open an issue on GitHub.
- TF-IDF Wikipedia
- Logistic Regression Wikipedia
- Scikit-learn Text Classification Tutorial
- NLTK Book
- ROC Curves Explained
Made with β€οΈ by the open-source community