NLP pipeline to analyze Urdu social media posts, classifying them into positive, negative, or neutral sentiments. The tool will serve multiple purposes, such as helping brands understand customer feedback, influencers gauging audience engagement, and businesses assessing public sentiment toward their products and services.
Phase 1: Text Preprocessing for Urdu Text
- Stopword Removal: o Develop a custom list of Urdu stopwords (e.g., "اور", "یہ", "کہ"). Write a function to remove these stopwords from your dataset of social media posts. o Challenges to Address: Handle words that are often considered stopwords but may carry sentiment (e.g., "نہیں" (no), "برا" (bad)).
- Punctuation, Emojis, and Hashtags: o Remove unnecessary punctuation, emojis, URLs, and hashtags that don’t contribute to sentiment. o Bonus Task: Use a dictionary to translate common emojis into sentiment (e.g., 😊 = positive, 😢 = negative).
- Short Conversations: o Write a rule-based function to filter out very short posts or those with less than three words, as they may not carry sufficient sentiment.
Phase 2: Stemming and Lemmatization for Urdu Text
- Stemming: o Implement or utilize a stemming algorithm for Urdu. The algorithm should reduce word variants to their base form (e.g., "اچھا", "اچھی", "اچھے" → "اچھا"). o Challenges to Address: Handling word inflections due to gender and plurality in Urdu.
- Lemmatization: o Implement lemmatization for Urdu, which requires using dictionaries or rules to return words to their dictionary form. o Expected Output: For example, "چل رہی" (is moving) should be reduced to "چل" (move).
Phase 3: Feature Extraction from Urdu Text
- Tokenization: o Implement word tokenization for Urdu text, ensuring that the Urdu script is properly segmented into words. You can use existing tokenizers or build your own. o Deliverable: Provide a tokenized version of several Urdu social media posts.
- Tf-IDF (Term Frequency-Inverse Document Frequency): o Apply the Tf-IDF algorithm to extract relevant terms from the dataset. Identify the most important terms contributing to sentiment in Urdu text. o Expected Output: A table showing the top 10 words with the highest TF-IDF scores from the dataset.
- Word2Vec: o Train a Word2Vec model on your dataset to capture the relationship between Urdu words based on context. o Deliverable: List the top 5 words most similar to the word "اچھا" (good) using the trained model.
Phase 4: N-grams Analysis
- Unigram, Bigram, and Trigram Analysis: o Create unigrams, bigrams, and trigrams from the dataset of Urdu text. o Deliverable: List the top 10 most common bigrams and trigrams in the dataset, along with their frequencies. o Challenges to Address: Proper tokenization for Urdu to avoid breaking the words incorrectly (due to right-to-left script).
Phase 5: Sentiment Classification Model
- Model Building: o Using the features extracted (from Tf-IDF or Word2Vec), build a machine learning model (e.g., Logistic Regression, SVM, or Naive Bayes) to classify the sentiment of the Urdu posts. o Deliverable: Show the accuracy, precision, recall, and F1-score of your sentiment classifier using a test set of Urdu text