A minimal, student-friendly NLP project that automatically categorizes user comments into 7 distinct categories and suggests appropriate reply templates.
Help brands and content creators efficiently manage user feedback by automatically categorizing comments as:
- Praise - Positive appreciation
- Support - Encouragement
- Constructive Criticism - Helpful negative feedback
- Hate/Abuse - Offensive content
- Threat - Dangerous/threatening content
- Emotional - Personal/emotional responses
- Spam - Irrelevant promotional content
- Question - User inquiries
comment_categorise/
β
βββ README.md # This file
βββ requirements.txt # Python dependencies
β
βββ data/
β βββ generate_dataset.py # Generate synthetic training data
β βββ comments_dataset.csv # Generated dataset (1199+ samples)
β
βββ models/ # Trained models (created after training)
β βββ comment_classifier.pkl
β βββ label_encoder.pkl
β
βββ src/ # Source code
β βββ __init__.py
β βββ utils.py # Preprocessing utilities
β βββ train.py # Model training script
β βββ predict.py # Prediction script
β
βββ app.py # Streamlit web UI
β
βββ outputs/ # Categorized results
# Create virtual environment (recommended)
python -m venv venv
# Activate virtual environment
.\venv\Scripts\Activate.ps1
# Install dependencies
pip install -r requirements.txtpython data/generate_dataset.pyThis creates data/comments_dataset.csv with ~1200 labeled comments across 8 categories.
python src/train.py --data data/comments_dataset.csv --output modelsExpected output:
- Classification report with accuracy metrics
- Saved model files in
models/directory
Single comment:
python src/predict.py --text "Amazing work! Loved the animation."Batch CSV:
python src/predict.py --input data/comments_dataset.csv --output outputs/categorized_comments.csvstreamlit run app.pyOpen browser at http://localhost:8501
Input: "Amazing work! Loved the animation."
Predicted Category: PRAISE
Confidence: 82.29%
Suggested Reply:
"Thank you so much for your kind words! π We're thrilled you enjoyed it.
Stay tuned for more content!"
The tool generates a CSV file (outputs/categorized_comments.csv) with:
- Original comment text
- Predicted category
- Confidence score
Sample output:
| Comment | Predicted Category | Confidence |
|---|---|---|
| Amazing work! Loved the animation. | praise | 0.8229 |
| The animation was okay but the voiceover felt off. | constructive | 0.7863 |
| This is trash, quit now. | hate_abuse | 0.7546 |
| Can you make one on topic X? | question | 0.7836 |
Single comment analysis with confidence scores and reply template
Batch CSV upload for processing multiple comments
Category distribution visualization
Note: To capture screenshots, run the Streamlit app and use your browser's screenshot tool or Snipping Tool (Windows).
| Component | Technology |
|---|---|
| Language | Python 3.8+ |
| ML Framework | scikit-learn |
| NLP | NLTK (tokenization, lemmatization) |
| Feature Extraction | TF-IDF |
| Classifier | Logistic Regression (multinomial) |
| Web UI | Streamlit |
| Visualization | Matplotlib, Seaborn |
Input Comment
β
Text Cleaning (lowercase, remove URLs/mentions)
β
Tokenization (split into words)
β
Lemmatization (reduce to base forms)
β
TF-IDF Vectorization (convert to numerical features)
β
Logistic Regression Classifier
β
Predicted Category + Confidence Scores
def preprocess_comment(text: str) -> str:
"""
Clean, tokenize, and lemmatize comment text.
Steps:
1. Lowercase conversion
2. Remove URLs and @mentions
3. Remove special characters
4. Tokenize into words
5. Lemmatize (running β run)
6. Join back to string for TF-IDF
"""- Loads CSV dataset
- Applies preprocessing
- Splits into train/test (80/20)
- Trains TF-IDF + LogisticRegression pipeline
- Evaluates on test set
- Saves model and label encoder
- Loads trained model
- Preprocesses input text
- Returns predicted category + confidence scores
- Supports single text or batch CSV
- Interactive web interface
- Single comment analysis with reply templates
- Batch CSV upload and categorization
- Visual analytics (bar charts)
- Download categorized results
Students will understand:
- Text Preprocessing: Cleaning, tokenization, lemmatization
- Feature Engineering: TF-IDF vectorization for text
- Classification: Multi-class logistic regression
- Model Evaluation: Precision, recall, F1-score
- Deployment: Building interactive ML apps with Streamlit
β
Reply templates for each category
β
Streamlit web UI with upload
β
Confidence scores for predictions
β
Visual analytics (category distribution chart)
β
Batch processing with CSV export
β
Well-documented, modular code
Expected Performance (on synthetic data):
- Overall Accuracy: ~95%+
- Per-category F1-scores: 0.90-0.99
Note: Real-world performance depends on training data quality and diversity.
- Better Dataset: Use real social media comments (Twitter API, Reddit)
- Advanced Models: Try BERT/DistilBERT for better accuracy
- Imbalanced Data: Add class weights or SMOTE sampling
- More Features: Sentiment scores, toxicity detection
- Deployment: Host on Streamlit Cloud or Hugging Face Spaces
- Jay Nagose
This project is for educational purposes.
Need help? Check the inline code comments or run scripts with --help flag.