Skip to content

Use tfidf and logistic regression to produce a simple sentiment analysis model on film data.

License

Notifications You must be signed in to change notification settings

benstanbury/Text_Classification_Model

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Sentiment Classifier 🎬

Python 3.7+ License: MIT scikit-learn

A professional sentiment analysis system using TF-IDF vectorization and Logistic Regression for movie review classification. This project demonstrates best practices in machine learning pipeline development, achieving ~90% accuracy on the IMDB movie review dataset.

🌟 Features

  • Clean Architecture: Modular design with separate components for data loading, model training, and evaluation
  • Automated Hyperparameter Tuning: Uses RandomizedSearchCV for efficient parameter optimization
  • Comprehensive Evaluation: Includes confusion matrix, ROC curves, and detailed classification metrics
  • Easy to Use: Simple API for training and prediction
  • Well Documented: Extensive documentation and examples
  • Production Ready: Includes model serialization and loading capabilities

πŸ“ Project Structure

Text_Classification_Model/
β”œβ”€β”€ src/
β”‚   └── sentiment_classifier/
β”‚       β”œβ”€β”€ __init__.py           # Package initialization
β”‚       β”œβ”€β”€ data_loader.py        # Data loading and preprocessing
β”‚       β”œβ”€β”€ model.py              # Sentiment classifier model
β”‚       └── evaluator.py          # Model evaluation and visualization
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ labeledTrainData.tsv     # Training dataset (not in git)
β”‚   └── moviedata.csv             # Movie review data (not in git)
β”œβ”€β”€ notebooks/
β”‚   └── TFIDF_LR_Model.ipynb     # Exploratory analysis notebook
β”œβ”€β”€ examples/
β”‚   └── train_and_evaluate.py    # Example usage script
β”œβ”€β”€ tests/                        # Unit tests (to be added)
β”œβ”€β”€ docs/                         # Additional documentation
β”œβ”€β”€ requirements.txt              # Project dependencies
β”œβ”€β”€ setup.py                      # Package installation script
β”œβ”€β”€ .gitignore                    # Git ignore rules
└── README.md                     # This file

πŸš€ Quick Start

Installation

  1. Clone the repository

    git clone https://github.com/yourusername/Text_Classification_Model.git
    cd Text_Classification_Model
  2. Create a virtual environment (recommended)

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Download NLTK stopwords

    python -c "import nltk; nltk.download('stopwords')"
  5. Install the package (optional, for development)

    pip install -e .

Basic Usage

from sentiment_classifier import SentimentClassifier, DataLoader, ModelEvaluator

# Load and split data
data_loader = DataLoader('data/labeledTrainData.tsv', test_size=0.3, random_state=42)
X_train, X_test, y_train, y_test = data_loader.load_and_split()

# Create and train the model
classifier = SentimentClassifier()
classifier.train(X_train, y_train)

# Evaluate the model
evaluator = ModelEvaluator(classifier)
results = evaluator.evaluate(X_test, y_test)

# Make predictions
predictions = classifier.predict(["This movie was fantastic!", "Terrible waste of time"])
print(predictions)  # [1, 0] (1 = positive, 0 = negative)

Running the Example

python examples/train_and_evaluate.py

πŸ“Š Model Performance

The model achieves excellent performance on the IMDB movie review dataset:

  • Accuracy: ~90%
  • Precision: ~90% (balanced across classes)
  • Recall: ~90% (balanced across classes)
  • ROC AUC Score: ~0.96

Sample Results

Classification Report:
              precision    recall  f1-score   support
           0       0.91      0.88      0.90      3785
           1       0.88      0.91      0.90      3715
   avg/total       0.90      0.90      0.90      7500

ROC AUC Score: 0.9607

πŸ”§ Model Pipeline

The classifier uses a scikit-learn Pipeline with three stages:

  1. CountVectorizer: Converts text to token count matrix

    • Removes English stopwords
    • Supports n-grams (unigrams, bigrams, trigrams)
    • Filters rare words using min_df parameter
  2. TfidfTransformer: Transforms counts to TF-IDF features

    • Weighs terms by importance
    • Normalizes feature vectors
  3. LogisticRegression: Binary classification

    • L2 regularization to prevent overfitting
    • Optimized with hyperparameter tuning

Hyperparameter Tuning

The model uses RandomizedSearchCV to efficiently search the following parameter space:

  • ngram_range: (1,1), (1,2), (1,3)
  • min_df: 2, 3, 4
  • C (regularization): 1, 10, 20

πŸ““ Jupyter Notebook

For interactive exploration, check out the Jupyter notebook:

jupyter notebook notebooks/TFIDF_LR_Model.ipynb

The notebook includes:

  • Detailed explanations of each step
  • Data exploration and visualization
  • Model training and evaluation
  • Performance metrics and plots

🎯 Use Cases

This sentiment classifier can be applied to:

  • Movie/Product Reviews: Classify customer sentiment
  • Social Media Monitoring: Analyze public opinion
  • Customer Feedback: Automatically categorize support tickets
  • Market Research: Understand consumer attitudes

πŸ“š Dataset

The model is trained on the IMDB movie review dataset:

  • Source: Kaggle - Word2Vec NLP Tutorial
  • Size: 25,000 labeled movie reviews
  • Balance: 50% positive, 50% negative reviews
  • Format: TSV file with 'sentiment' and 'review' columns

Note: Due to size, data files are not included in the repository. Download them from Kaggle and place in the data/ directory.

πŸ› οΈ Development

Running Tests

pytest tests/

Code Style

This project follows PEP 8 style guidelines. Format code using:

black src/
flake8 src/

Adding Features

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“– Documentation

For detailed API documentation, see the docstrings in each module:

  • src/sentiment_classifier/data_loader.py - Data loading utilities
  • src/sentiment_classifier/model.py - Model implementation
  • src/sentiment_classifier/evaluator.py - Evaluation tools

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Dataset: Kaggle IMDB Dataset
  • Libraries: scikit-learn, NLTK, pandas, matplotlib
  • Inspiration: Text classification tutorials and best practices from the ML community

πŸ“ž Contact

For questions or feedback, please open an issue on GitHub.

πŸ”— Resources


Made with ❀️ by the open-source community

About

Use tfidf and logistic regression to produce a simple sentiment analysis model on film data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published