Skip to content

Mini project that takes in user comments (e.g., from a social media post or a product announcement), analyzes them using Natural Language Processing (NLP), and categorizes each comment based on the underlying emotion or intent such as praise, hate, constructive criticism, spam, or questions.

Notifications You must be signed in to change notification settings

RADson2005official/Comment-categorisation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

7 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ’¬ Comment Categorization & Reply Assistant Tool

A minimal, student-friendly NLP project that automatically categorizes user comments into 7 distinct categories and suggests appropriate reply templates.

🎯 Project Objective

Help brands and content creators efficiently manage user feedback by automatically categorizing comments as:

  • Praise - Positive appreciation
  • Support - Encouragement
  • Constructive Criticism - Helpful negative feedback
  • Hate/Abuse - Offensive content
  • Threat - Dangerous/threatening content
  • Emotional - Personal/emotional responses
  • Spam - Irrelevant promotional content
  • Question - User inquiries

πŸ“ Project Structure

comment_categorise/
β”‚
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ generate_dataset.py           # Generate synthetic training data
β”‚   └── comments_dataset.csv          # Generated dataset (1199+ samples)
β”‚
β”œβ”€β”€ models/                            # Trained models (created after training)
β”‚   β”œβ”€β”€ comment_classifier.pkl        
β”‚   └── label_encoder.pkl
β”‚
β”œβ”€β”€ src/                               # Source code
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ utils.py                      # Preprocessing utilities
β”‚   β”œβ”€β”€ train.py                      # Model training script
β”‚   └── predict.py                    # Prediction script
β”‚
β”œβ”€β”€ app.py                             # Streamlit web UI
β”‚
└── outputs/                           # Categorized results

πŸš€ Quick Start

1. Setup Environment

# Create virtual environment (recommended)
python -m venv venv

# Activate virtual environment
.\venv\Scripts\Activate.ps1

# Install dependencies
pip install -r requirements.txt

2. Generate Dataset

python data/generate_dataset.py

This creates data/comments_dataset.csv with ~1200 labeled comments across 8 categories.

3. Train Model

python src/train.py --data data/comments_dataset.csv --output models

Expected output:

  • Classification report with accuracy metrics
  • Saved model files in models/ directory

4. Test Predictions

Single comment:

python src/predict.py --text "Amazing work! Loved the animation."

Batch CSV:

python src/predict.py --input data/comments_dataset.csv --output outputs/categorized_comments.csv

5. Launch Web UI (Bonus)

streamlit run app.py

Open browser at http://localhost:8501

πŸ“Š Sample Outputs

Single Comment Prediction

Input: "Amazing work! Loved the animation."
Predicted Category: PRAISE
Confidence: 82.29%

Suggested Reply:
"Thank you so much for your kind words! 😊 We're thrilled you enjoyed it. 
Stay tuned for more content!"

Batch Processing Results

The tool generates a CSV file (outputs/categorized_comments.csv) with:

  • Original comment text
  • Predicted category
  • Confidence score

Sample output:

Comment Predicted Category Confidence
Amazing work! Loved the animation. praise 0.8229
The animation was okay but the voiceover felt off. constructive 0.7863
This is trash, quit now. hate_abuse 0.7546
Can you make one on topic X? question 0.7836

UI Screenshots

Streamlit UI - Single Prediction Single comment analysis with confidence scores and reply template

Streamlit UI - Batch Upload Batch CSV upload for processing multiple comments

Streamlit UI - Analytics Category distribution visualization

Note: To capture screenshots, run the Streamlit app and use your browser's screenshot tool or Snipping Tool (Windows).

πŸ› οΈ Technical Stack

Component Technology
Language Python 3.8+
ML Framework scikit-learn
NLP NLTK (tokenization, lemmatization)
Feature Extraction TF-IDF
Classifier Logistic Regression (multinomial)
Web UI Streamlit
Visualization Matplotlib, Seaborn

πŸ“Š Model Architecture

Input Comment
    ↓
Text Cleaning (lowercase, remove URLs/mentions)
    ↓
Tokenization (split into words)
    ↓
Lemmatization (reduce to base forms)
    ↓
TF-IDF Vectorization (convert to numerical features)
    ↓
Logistic Regression Classifier
    ↓
Predicted Category + Confidence Scores

πŸ“ Code Explanation

src/utils.py - Preprocessing

def preprocess_comment(text: str) -> str:
    """
    Clean, tokenize, and lemmatize comment text.
    
    Steps:
    1. Lowercase conversion
    2. Remove URLs and @mentions
    3. Remove special characters
    4. Tokenize into words
    5. Lemmatize (running β†’ run)
    6. Join back to string for TF-IDF
    """

src/train.py - Model Training

  • Loads CSV dataset
  • Applies preprocessing
  • Splits into train/test (80/20)
  • Trains TF-IDF + LogisticRegression pipeline
  • Evaluates on test set
  • Saves model and label encoder

src/predict.py - Prediction

  • Loads trained model
  • Preprocesses input text
  • Returns predicted category + confidence scores
  • Supports single text or batch CSV

app.py - Streamlit UI

  • Interactive web interface
  • Single comment analysis with reply templates
  • Batch CSV upload and categorization
  • Visual analytics (bar charts)
  • Download categorized results

πŸŽ“ Learning Outcomes

Students will understand:

  1. Text Preprocessing: Cleaning, tokenization, lemmatization
  2. Feature Engineering: TF-IDF vectorization for text
  3. Classification: Multi-class logistic regression
  4. Model Evaluation: Precision, recall, F1-score
  5. Deployment: Building interactive ML apps with Streamlit

🌟 Bonus Features Implemented

βœ… Reply templates for each category
βœ… Streamlit web UI with upload
βœ… Confidence scores for predictions
βœ… Visual analytics (category distribution chart)
βœ… Batch processing with CSV export
βœ… Well-documented, modular code

πŸ“ˆ Results

Expected Performance (on synthetic data):

  • Overall Accuracy: ~95%+
  • Per-category F1-scores: 0.90-0.99

Note: Real-world performance depends on training data quality and diversity.

πŸ”„ Next Steps / Improvements

  1. Better Dataset: Use real social media comments (Twitter API, Reddit)
  2. Advanced Models: Try BERT/DistilBERT for better accuracy
  3. Imbalanced Data: Add class weights or SMOTE sampling
  4. More Features: Sentiment scores, toxicity detection
  5. Deployment: Host on Streamlit Cloud or Hugging Face Spaces

πŸ“š Resources

πŸ‘€ Author

  • Jay Nagose

πŸ“„ License

This project is for educational purposes.


Need help? Check the inline code comments or run scripts with --help flag.

About

Mini project that takes in user comments (e.g., from a social media post or a product announcement), analyzes them using Natural Language Processing (NLP), and categorizes each comment based on the underlying emotion or intent such as praise, hate, constructive criticism, spam, or questions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages