Persian Sentiment Classification: Sentiment Explorer

A comprehensive project aimed at building a sentiment classification pipeline for Persian-language product reviews using modern natural language processing techniques and traditional machine learning algorithms.

Overview

This project was designed to process and classify Persian comments based on sentiment orientation (recommendation vs. non-recommendation). The task involved several key stages:

Preprocessing Persian text data
Training a Word2Vec model to create word embeddings
Constructing sentence embeddings by averaging word vectors
Using a Logistic Regression classifier to predict sentiment
Generating structured outputs for evaluation

This report documents the design rationale, development process, and insights gathered throughout the implementation.

1. Sentiment Label Distribution

This pie chart illustrates the balance among sentiment categories in the dataset. All three classes—recommended, not_recommended, and no_idea—are evenly distributed (≈33.3% each), ensuring no significant class imbalance.

2. Review Lengths Before and After Preprocessing

The left histogram shows raw review word counts before cleaning, while the right shows token counts after preprocessing. Most reviews are short (<50 words/tokens), and preprocessing slightly reduces length without altering overall distribution.

3. Boxplot of Review Lengths

This boxplot compares review lengths before and after preprocessing. Outliers (>400 words/tokens) remain visible, but the median and interquartile range show consistency, confirming preprocessing did not distort content structure.

Dataset Description

Two datasets were provided:

train.csv: Contains Persian comments with a column called recommendation_status indicating the sentiment label.
test.csv: Contains new comments with no labels; the goal is to predict their sentiment class.

The datasets had already undergone initial cleaning. There were no missing values or noisy entries requiring removal. However, numerical encoding and detailed analysis of the text structure were necessary.

Data Preprocessing

Preprocessing Persian-language text presents unique challenges:

Key Steps Implemented:

Text Normalization: Converted different forms of Persian letters to a standard format (e.g., Arabic "ي" to Persian "ی").
Tokenization: Split sentences into words.
Digit Removal: Removed both Persian (e.g., ۱۲۳) and Latin (123) digits.
Punctuation Removal: Eliminated symbols such as !, ؟, ،, etc.
Stopword Removal: Removed frequently occurring, semantically light words (e.g., "که", "از", "برای").
Stemming: Reduced words to their root forms using Persian language rules.
Whitespace Cleaning: Removed excessive spaces and line breaks.

All steps were encapsulated in a single preprocessing function that could be applied to any Persian sentence.

4. Top 20 Most Frequent Words

This bar chart highlights the most common Persian words after preprocessing. Frequent tokens like “و” (and) and “می” (verb prefix) dominate, which are useful for understanding vocabulary distribution and identifying potential stopwords.

5. Co-occurrence Heatmap of Top 10 Words

This heatmap shows word co-occurrence frequencies among the top 10 tokens. Darker colors indicate higher co-occurrence, helping visualize contextual relationships in the corpus for embedding quality checks.

Word Embedding with Word2Vec

To represent words numerically, a custom Word2Vec model was trained on the preprocessed comments. Word2Vec enabled the project to capture semantic relationships between words by mapping them into a continuous vector space.

Each sentence was converted into a fixed-size vector by averaging its word vectors — a technique known as sentence embedding by mean pooling.

6. PCA Projection of Word Embeddings

The 2D PCA plot projects selected Word2Vec embeddings to reveal semantic clustering. Words with similar meanings or contexts tend to group together, validating the embedding space structure.

7. Sentence Embedding Norm Distribution

The histogram displays the norms of sentence vectors. Most sentence embeddings cluster between 4 and 6, which is typical for normalized vector spaces and suggests consistent embedding magnitudes.

8. Confusion Matrix

This matrix evaluates model performance on the validation set. Strong diagonal dominance indicates accurate classification for all three categories, though some misclassifications between no_idea and the other labels remain.

Sentiment Classification Model

Model Used:

Logistic Regression — a simple yet effective linear classifier for binary classification tasks.

Data Split:

80% of the training set was used to train the model.
20% was used as a validation set to assess model performance.

Evaluation Metric:

Accuracy — the percentage of correct predictions on the validation set.
The model achieved accuracy well above the minimum acceptable threshold of 50%.

9. Logistic Regression Coefficients

The line plot visualizes feature coefficients for each sentiment class. Peaks and troughs represent influential features driving classification decisions across classes.

Prediction Function

A general-purpose function was developed to classify new comments using the trained pipeline. It:

Preprocesses the comment text.
Converts it to a sentence vector.
Feeds it to the trained classifier.
Returns a label:
- recommended
- not_recommended
- no_idea (fallback for ambiguous input)

10. ROC Curves for Multi-Class Sentiment Classification

The ROC curves demonstrate the classifier’s ability to distinguish between classes.

AUC values:
- not_recommended: 0.87
- recommended: 0.88
- no_idea: 0.75 These values indicate strong separability for the first two classes, while no_idea is slightly harder to differentiate.

Test Set Inference & Submission

Predictions were made on the test.csv dataset. Each entry was classified and stored in a new DataFrame with the following structure:

class
not_recommended
recommended
...

This DataFrame was saved as a CSV and archived as result.zip for final evaluation and submission.

11. t-SNE Visualization of Sentence Embeddings

This scatter plot projects high-dimensional sentence embeddings into 2D space using t-SNE. While some clustering is visible, overlap among classes reflects semantic ambiguity between certain sentiments.

12. Model Accuracy on Train vs Test Sets

This bar chart compares accuracy on the training and test sets. The nearly identical values (~67%) show that the model generalizes well without significant overfitting.

Summary of Achievements

✅ Successfully cleaned and normalized Persian-language comments
✅ Trained Word2Vec model to capture semantic similarity
✅ Created an end-to-end sentiment classification pipeline
✅ Reached high classification accuracy on validation data
✅ Generated standardized output for evaluation

Future Work

While the current pipeline performs reliably, the following enhancements could yield stronger results:

📌 Switch to transformer-based models such as ParsBERT or multilingual BERT for contextual embeddings
📌 Use of FastText for better handling of out-of-vocabulary words in Persian
📌 Add explainability layer for model predictions (e.g., LIME, SHAP)
📌 Hyperparameter tuning using cross-validation or grid search
📌 Multi-class sentiment support (positive, neutral, negative) for finer-grained analysis

Author

This project was developed as part of a structured machine learning assignment focused on natural language processing with a concentration in Persian text mining.

References

Gensim Word2Vec Documentation
scikit-learn API Reference
Hazm (Python toolkit for Persian NLP)
fastText by Facebook AI
ParsBERT: A Transformer-based Model for Persian Language Understanding

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Plots		Plots
data		data
.DS_Store		.DS_Store
README.md		README.md
persian_comments_preprocessing.ipynb		persian_comments_preprocessing.ipynb
result.zip		result.zip
submission.csv		submission.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Persian Sentiment Classification: Sentiment Explorer

Overview

1. Sentiment Label Distribution

2. Review Lengths Before and After Preprocessing

3. Boxplot of Review Lengths

Dataset Description

Data Preprocessing

Key Steps Implemented:

4. Top 20 Most Frequent Words

5. Co-occurrence Heatmap of Top 10 Words

Word Embedding with Word2Vec

6. PCA Projection of Word Embeddings

7. Sentence Embedding Norm Distribution

8. Confusion Matrix

Sentiment Classification Model

Model Used:

Data Split:

Evaluation Metric:

9. Logistic Regression Coefficients

Prediction Function

10. ROC Curves for Multi-Class Sentiment Classification

Test Set Inference & Submission

11. t-SNE Visualization of Sentence Embeddings

12. Model Accuracy on Train vs Test Sets

Summary of Achievements

Future Work

Author

References

About

Uh oh!

Releases

Packages

Languages

Mah-En/Sentiment-Explorer-Persian-Comment-Classification

Folders and files

Latest commit

History

Repository files navigation

Persian Sentiment Classification: Sentiment Explorer

Overview

1. Sentiment Label Distribution

2. Review Lengths Before and After Preprocessing

3. Boxplot of Review Lengths

Dataset Description

Data Preprocessing

Key Steps Implemented:

4. Top 20 Most Frequent Words

5. Co-occurrence Heatmap of Top 10 Words

Word Embedding with Word2Vec

6. PCA Projection of Word Embeddings

7. Sentence Embedding Norm Distribution

8. Confusion Matrix

Sentiment Classification Model

Model Used:

Data Split:

Evaluation Metric:

9. Logistic Regression Coefficients

Prediction Function

10. ROC Curves for Multi-Class Sentiment Classification

Test Set Inference & Submission

11. t-SNE Visualization of Sentence Embeddings

12. Model Accuracy on Train vs Test Sets

Summary of Achievements

Future Work

Author

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages