A comprehensive project aimed at building a sentiment classification pipeline for Persian-language product reviews using modern natural language processing techniques and traditional machine learning algorithms.
This project was designed to process and classify Persian comments based on sentiment orientation (recommendation vs. non-recommendation). The task involved several key stages:
- Preprocessing Persian text data
- Training a Word2Vec model to create word embeddings
- Constructing sentence embeddings by averaging word vectors
- Using a Logistic Regression classifier to predict sentiment
- Generating structured outputs for evaluation
This report documents the design rationale, development process, and insights gathered throughout the implementation.
This pie chart illustrates the balance among sentiment categories in the dataset. All three classes—recommended, not_recommended, and no_idea—are evenly distributed (≈33.3% each), ensuring no significant class imbalance.
The left histogram shows raw review word counts before cleaning, while the right shows token counts after preprocessing. Most reviews are short (<50 words/tokens), and preprocessing slightly reduces length without altering overall distribution.
This boxplot compares review lengths before and after preprocessing. Outliers (>400 words/tokens) remain visible, but the median and interquartile range show consistency, confirming preprocessing did not distort content structure.
Two datasets were provided:
train.csv: Contains Persian comments with a column calledrecommendation_statusindicating the sentiment label.test.csv: Contains new comments with no labels; the goal is to predict their sentiment class.
The datasets had already undergone initial cleaning. There were no missing values or noisy entries requiring removal. However, numerical encoding and detailed analysis of the text structure were necessary.
Preprocessing Persian-language text presents unique challenges:
- Text Normalization: Converted different forms of Persian letters to a standard format (e.g., Arabic "ي" to Persian "ی").
- Tokenization: Split sentences into words.
- Digit Removal: Removed both Persian (e.g., ۱۲۳) and Latin (123) digits.
- Punctuation Removal: Eliminated symbols such as
!,؟,،, etc. - Stopword Removal: Removed frequently occurring, semantically light words (e.g., "که", "از", "برای").
- Stemming: Reduced words to their root forms using Persian language rules.
- Whitespace Cleaning: Removed excessive spaces and line breaks.
All steps were encapsulated in a single preprocessing function that could be applied to any Persian sentence.
This bar chart highlights the most common Persian words after preprocessing. Frequent tokens like “و” (and) and “می” (verb prefix) dominate, which are useful for understanding vocabulary distribution and identifying potential stopwords.
This heatmap shows word co-occurrence frequencies among the top 10 tokens. Darker colors indicate higher co-occurrence, helping visualize contextual relationships in the corpus for embedding quality checks.
To represent words numerically, a custom Word2Vec model was trained on the preprocessed comments. Word2Vec enabled the project to capture semantic relationships between words by mapping them into a continuous vector space.
Each sentence was converted into a fixed-size vector by averaging its word vectors — a technique known as sentence embedding by mean pooling.
The 2D PCA plot projects selected Word2Vec embeddings to reveal semantic clustering. Words with similar meanings or contexts tend to group together, validating the embedding space structure.
The histogram displays the norms of sentence vectors. Most sentence embeddings cluster between 4 and 6, which is typical for normalized vector spaces and suggests consistent embedding magnitudes.
This matrix evaluates model performance on the validation set. Strong diagonal dominance indicates accurate classification for all three categories, though some misclassifications between no_idea and the other labels remain.
- Logistic Regression — a simple yet effective linear classifier for binary classification tasks.
- 80% of the training set was used to train the model.
- 20% was used as a validation set to assess model performance.
- Accuracy — the percentage of correct predictions on the validation set.
- The model achieved accuracy well above the minimum acceptable threshold of 50%.
The line plot visualizes feature coefficients for each sentiment class. Peaks and troughs represent influential features driving classification decisions across classes.
A general-purpose function was developed to classify new comments using the trained pipeline. It:
- Preprocesses the comment text.
- Converts it to a sentence vector.
- Feeds it to the trained classifier.
- Returns a label:
recommendednot_recommendedno_idea(fallback for ambiguous input)
The ROC curves demonstrate the classifier’s ability to distinguish between classes.
-
AUC values:
not_recommended: 0.87recommended: 0.88no_idea: 0.75 These values indicate strong separability for the first two classes, whileno_ideais slightly harder to differentiate.
Predictions were made on the test.csv dataset. Each entry was classified and stored in a new DataFrame with the following structure:
| class |
|---|
| not_recommended |
| recommended |
| ... |
This DataFrame was saved as a CSV and archived as result.zip for final evaluation and submission.
This scatter plot projects high-dimensional sentence embeddings into 2D space using t-SNE. While some clustering is visible, overlap among classes reflects semantic ambiguity between certain sentiments.
This bar chart compares accuracy on the training and test sets. The nearly identical values (~67%) show that the model generalizes well without significant overfitting.
- ✅ Successfully cleaned and normalized Persian-language comments
- ✅ Trained Word2Vec model to capture semantic similarity
- ✅ Created an end-to-end sentiment classification pipeline
- ✅ Reached high classification accuracy on validation data
- ✅ Generated standardized output for evaluation
While the current pipeline performs reliably, the following enhancements could yield stronger results:
- 📌 Switch to transformer-based models such as ParsBERT or multilingual BERT for contextual embeddings
- 📌 Use of FastText for better handling of out-of-vocabulary words in Persian
- 📌 Add explainability layer for model predictions (e.g., LIME, SHAP)
- 📌 Hyperparameter tuning using cross-validation or grid search
- 📌 Multi-class sentiment support (positive, neutral, negative) for finer-grained analysis
This project was developed as part of a structured machine learning assignment focused on natural language processing with a concentration in Persian text mining.
- Gensim Word2Vec Documentation
- scikit-learn API Reference
- Hazm (Python toolkit for Persian NLP)
- fastText by Facebook AI
- ParsBERT: A Transformer-based Model for Persian Language Understanding