Skip to content

A sentiment classification pipeline for Persian-language product reviews using Word2Vec embeddings and logistic regression. The project includes text preprocessing, embedding, model training, and prediction on unseen test data.

Notifications You must be signed in to change notification settings

Mah-En/Sentiment-Explorer-Persian-Comment-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Persian Sentiment Classification: Sentiment Explorer

A comprehensive project aimed at building a sentiment classification pipeline for Persian-language product reviews using modern natural language processing techniques and traditional machine learning algorithms.


Overview

This project was designed to process and classify Persian comments based on sentiment orientation (recommendation vs. non-recommendation). The task involved several key stages:

  • Preprocessing Persian text data
  • Training a Word2Vec model to create word embeddings
  • Constructing sentence embeddings by averaging word vectors
  • Using a Logistic Regression classifier to predict sentiment
  • Generating structured outputs for evaluation

This report documents the design rationale, development process, and insights gathered throughout the implementation.


1. Sentiment Label Distribution

Sentiment Label Distribution This pie chart illustrates the balance among sentiment categories in the dataset. All three classes—recommended, not_recommended, and no_idea—are evenly distributed (≈33.3% each), ensuring no significant class imbalance.


2. Review Lengths Before and After Preprocessing

Review Lengths Before and After The left histogram shows raw review word counts before cleaning, while the right shows token counts after preprocessing. Most reviews are short (<50 words/tokens), and preprocessing slightly reduces length without altering overall distribution.


3. Boxplot of Review Lengths

Boxplot of Review Lengths This boxplot compares review lengths before and after preprocessing. Outliers (>400 words/tokens) remain visible, but the median and interquartile range show consistency, confirming preprocessing did not distort content structure.


Dataset Description

Two datasets were provided:

  • train.csv: Contains Persian comments with a column called recommendation_status indicating the sentiment label.
  • test.csv: Contains new comments with no labels; the goal is to predict their sentiment class.

The datasets had already undergone initial cleaning. There were no missing values or noisy entries requiring removal. However, numerical encoding and detailed analysis of the text structure were necessary.


Data Preprocessing

Preprocessing Persian-language text presents unique challenges:

Key Steps Implemented:

  1. Text Normalization: Converted different forms of Persian letters to a standard format (e.g., Arabic "ي" to Persian "ی").
  2. Tokenization: Split sentences into words.
  3. Digit Removal: Removed both Persian (e.g., ۱۲۳) and Latin (123) digits.
  4. Punctuation Removal: Eliminated symbols such as !, ؟, ،, etc.
  5. Stopword Removal: Removed frequently occurring, semantically light words (e.g., "که", "از", "برای").
  6. Stemming: Reduced words to their root forms using Persian language rules.
  7. Whitespace Cleaning: Removed excessive spaces and line breaks.

All steps were encapsulated in a single preprocessing function that could be applied to any Persian sentence.


4. Top 20 Most Frequent Words

Top 20 Frequent Words This bar chart highlights the most common Persian words after preprocessing. Frequent tokens like “و” (and) and “می” (verb prefix) dominate, which are useful for understanding vocabulary distribution and identifying potential stopwords.


5. Co-occurrence Heatmap of Top 10 Words

Co-occurrence Heatmap This heatmap shows word co-occurrence frequencies among the top 10 tokens. Darker colors indicate higher co-occurrence, helping visualize contextual relationships in the corpus for embedding quality checks.


Word Embedding with Word2Vec

To represent words numerically, a custom Word2Vec model was trained on the preprocessed comments. Word2Vec enabled the project to capture semantic relationships between words by mapping them into a continuous vector space.

Each sentence was converted into a fixed-size vector by averaging its word vectors — a technique known as sentence embedding by mean pooling.


6. PCA Projection of Word Embeddings

PCA Projection The 2D PCA plot projects selected Word2Vec embeddings to reveal semantic clustering. Words with similar meanings or contexts tend to group together, validating the embedding space structure.


7. Sentence Embedding Norm Distribution

Embedding Norm Distribution The histogram displays the norms of sentence vectors. Most sentence embeddings cluster between 4 and 6, which is typical for normalized vector spaces and suggests consistent embedding magnitudes.


8. Confusion Matrix

Confusion Matrix This matrix evaluates model performance on the validation set. Strong diagonal dominance indicates accurate classification for all three categories, though some misclassifications between no_idea and the other labels remain.


Sentiment Classification Model

Model Used:

  • Logistic Regression — a simple yet effective linear classifier for binary classification tasks.

Data Split:

  • 80% of the training set was used to train the model.
  • 20% was used as a validation set to assess model performance.

Evaluation Metric:

  • Accuracy — the percentage of correct predictions on the validation set.
  • The model achieved accuracy well above the minimum acceptable threshold of 50%.

9. Logistic Regression Coefficients

Logistic Regression Coefficients The line plot visualizes feature coefficients for each sentiment class. Peaks and troughs represent influential features driving classification decisions across classes.


Prediction Function

A general-purpose function was developed to classify new comments using the trained pipeline. It:

  1. Preprocesses the comment text.
  2. Converts it to a sentence vector.
  3. Feeds it to the trained classifier.
  4. Returns a label:
    • recommended
    • not_recommended
    • no_idea (fallback for ambiguous input)

10. ROC Curves for Multi-Class Sentiment Classification

ROC Curve The ROC curves demonstrate the classifier’s ability to distinguish between classes.

  • AUC values:

    • not_recommended: 0.87
    • recommended: 0.88
    • no_idea: 0.75 These values indicate strong separability for the first two classes, while no_idea is slightly harder to differentiate.

Test Set Inference & Submission

Predictions were made on the test.csv dataset. Each entry was classified and stored in a new DataFrame with the following structure:

class
not_recommended
recommended
...

This DataFrame was saved as a CSV and archived as result.zip for final evaluation and submission.


11. t-SNE Visualization of Sentence Embeddings

t-SNE Visualization This scatter plot projects high-dimensional sentence embeddings into 2D space using t-SNE. While some clustering is visible, overlap among classes reflects semantic ambiguity between certain sentiments.


12. Model Accuracy on Train vs Test Sets

Model Accuracy This bar chart compares accuracy on the training and test sets. The nearly identical values (~67%) show that the model generalizes well without significant overfitting.


Summary of Achievements

  • ✅ Successfully cleaned and normalized Persian-language comments
  • ✅ Trained Word2Vec model to capture semantic similarity
  • ✅ Created an end-to-end sentiment classification pipeline
  • ✅ Reached high classification accuracy on validation data
  • ✅ Generated standardized output for evaluation

Future Work

While the current pipeline performs reliably, the following enhancements could yield stronger results:

  • 📌 Switch to transformer-based models such as ParsBERT or multilingual BERT for contextual embeddings
  • 📌 Use of FastText for better handling of out-of-vocabulary words in Persian
  • 📌 Add explainability layer for model predictions (e.g., LIME, SHAP)
  • 📌 Hyperparameter tuning using cross-validation or grid search
  • 📌 Multi-class sentiment support (positive, neutral, negative) for finer-grained analysis

Author

This project was developed as part of a structured machine learning assignment focused on natural language processing with a concentration in Persian text mining.


References

  • Gensim Word2Vec Documentation
  • scikit-learn API Reference
  • Hazm (Python toolkit for Persian NLP)
  • fastText by Facebook AI
  • ParsBERT: A Transformer-based Model for Persian Language Understanding

About

A sentiment classification pipeline for Persian-language product reviews using Word2Vec embeddings and logistic regression. The project includes text preprocessing, embedding, model training, and prediction on unseen test data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published