This documentation provides a deep-dive analysis of the 70,000+ localized product reviews processed during the "Fake Review Detection and Trust Analytics" project. The goal is to maximize transparency in the data lifecycle—from initial acquisition to the deployment of the Intelligent Trust Framework.
The first stage of our pipeline focuses on the integrity of the data.
- Raw Volume: 70,000 reviews across 7 distinct applications (Amazon, Flipkart, etc.).
- Missing Value Handling: We implemented iterative imputation and deletion for non-critical features.
- Sentiment Polarization: Early EDA identified a negative skew in rating distributions, suggesting a predominance of high-rating reviews in the dataset.
We engineered a suite of textual and behavioral features to distinguish fake reviews:
- Linguistic Features: Analyzing sentence length, punctuation density, and repetition patterns.
- Behavioral Features:
- Rating Deviation: Measuring how much a review rating differs from the product mean.
- Review Density: Tracking the frequency of reviews from a single user profile.
Utilized a hybrid lexicon-based and machine-learning approach to calculate:
- Subjectivity Scores: Highly subjective reviews often correlate with deceptive content.
- Polarity Indices: Extreme sentiment inconsistency (e.g., a 5-star rating with a negative text body) was flagged as a major indicator of fraudulent activity.
The core detection engine uses a Hybrid ML model (Random Forest/LSTM) to classify reviews.
- Accuracy: Proven to identify 1,876 fake reviews (2.69% of the total volume).
- Precision vs. Recall: Optimized for high precision to avoid false positives—genuine users should not be penalized.
| Metric ID | Analysis Component | Value / Observation | Status |
|---|---|---|---|
| AUD-001 | Total Records | 70,000+ | ✅ VERIFIED |
| AUD-002 | Fake Reviews | 1,876 | ✅ DETECTED |
| AUD-003 | Accuracy Ratio | 97.31% | ✅ OPTIMIZED |
| AUD-004 | Rating Skewness | -0.83 (Neg) | ✅ ANALYZED |
| AUD-005 | App Count | 7 Platforms | ✅ MAPPED |
A cross-examination of ratings and helpfulness votes revealed that fake reviews often receive significantly fewer "helpful" votes, yet they tend to have higher capitalization density in their text. This "shouting" pattern is a hallmark of manipulative intent.
Detailed distribution analysis across apps like Amazon Shopping, Flipkart, and Zepto showed that the "Amazon Shopping" platform had the highest review volume, but the distribution of fake reviews remained relatively uniform across ecommerce categories, suggesting that fraudulent actors are not platform-specific.
Developed a multi-factor score for every user:
- Factor A: Historical Rating Consistency.
- Factor B: Temporal Review Frequency.
- Factor C: Sentiment Accuracy vs. Community Average.
The integration of behavioral features and sentiment-based linguistic analysis has significantly improved the detection rate of deceptive content. This framework provides a scalable solution for marketplaces to maintain the integrity of their platforms and foster consumer trust.
This log was generated as part of the 23CSE452 Business Analytics project Documentation Phase.