Online reviews significantly shape consumer trust and purchasing decisions. This project introduces a data-driven system to detect deceptive reviews, assess reviewer credibility, and quantify the impact of manipulation on consumer trust. By integrating NLP text analytics, behavioral analysis, and trust modeling, we deliver actionable insights for e-commerce platforms and businesses.
Fake-Review-Detection/ # Root Project Folder 🛡️
├── dashboard images/ # Individual visualization exports 🖼️
│ ├── Fake vs Genuine Reviews.png # Model comparison chart
│ ├── Final_dashboard.png # Full BI Dashboard preview
│ ├── Rating_distribution.png # Global rating stats
│ ├── Review Distribution by App.png # Cross-platform breakdown
│ ├── Review_trend_over_time.png # Time-series spike detector
│ ├── Sentiment_vs_rating.png # Correlation visualization
│ └── Top_suspicious_reviewers.png # RCI-flagged user list
├── docs/ # Strategic Documentation 📜
│ ├── 01_DATA_ACQUISITION.md # Sourcing & Ingestion logic
│ ├── 02_EXPLORATORY_ANALYSIS.md # Statistical deep-dives
│ ├── 03_DATA_CLEANING.md # NLP Pre-processing steps
│ ├── 04_SENTIMENT_ANALYSIS.md # Polarity & Subjectivity logic
│ ├── 05_FAKE_REVIEW_DETECTION.md # ML Engine & RCI Score
│ └── 06_BUSINESS_INTELLIGENCE.md # Business value & Insights
├── notebooks/ # Step-by-Step Development 📓
│ ├── 01_Data_Acquisition.ipynb # Raw data fetching
│ ├── 02_Data_Cleaning.ipynb # NLP refining & noise removal
│ ├── 03_Sentiment_Analysis.ipynb # Polarity & Subjectivity experiments
│ ├── 04_Fake_Detection.ipynb # ML model training (Random Forest)
│ └── 05_Visualization.ipynb # Chart & Graph generation
├── api.py # Core ML Engine Integration Endpoint
├── ARCHITECTURE.md # Technical design & hierarchy
├── COMPREHENSIVE_DOCUMENTATION.md # Combined project overview
├── DATA_INSIGHTS.md # High-level analytical report
├── DEPLOY.md # Environment setup guide
├── fake_review_model.pkl # Trained Random Forest model
├── feature_names.pkl # Saved model feature vectors
├── Fake_Review_Analytics.twbx # Tableau BI Workbook
└── requirements.txt # Python dependency list
Data Acquisition ➔ EDA ➔ Data Cleaning ➔ Sentiment Analysis ➔ ML Detection ➔ BI Dashboard
- Data Ingestion: Harvesting 70,000+ localized reviews from Amazon, Flipkart, Zepto, and Shopsy.
- Exploratory Data Analysis: Identifying statistical anomalies, rating skewness, and "Review Bursting" patterns.
- Advanced Pre-processing: NLP pipeline involving tokenization, stopword removal, and lemmatization.
- Sentiment Profiling: Applying lexicon-based scoring to detect rating-sentiment contradictions.
- Hybrid ML Detection: Using a Random Forest engine to calculate the Reviewer Credibility Index (RCI).
- Business Intelligence: Visualizing real-time fraud trends and product score corrections on a BI dashboard.
- Total Reviews Analyzed: 70,000+
- Fake Reviews Detected: 1,876 (2.69%)
- Model Accuracy: 97.31%
- Data Source: Multi-platform localized product reviews (Amazon, Flipkart, Zepto, etc.)
This project is divided into six distinct stages. For deep dives into the methodology, click the links below:
| Stage | Focus Area | Documentation |
|---|---|---|
| 📦 | Data Acquisition | 01_DATA_ACQUISITION.md |
| 📈 | Exploratory Analysis | 02_EXPLORATORY_ANALYSIS.md |
| 🧼 | cleaning & Pre-processing | 03_DATA_CLEANING.md |
| 🧠 | Sentiment & Linguistic Profiling | 04_SENTIMENT_ANALYSIS.md |
| 🤖 | Fake Review Detection & ML Engine | 05_FAKE_REVIEW_DETECTION.md |
| 📊 | Business Intelligence & Visualization | 06_BUSINESS_INTELLIGENCE.md |
- Rating Distribution:

- Review Trend Over Time:

- Fake vs Genuine Reviews:

- Sentiment vs Rating Correlation:

- Top Suspicious Reviewers:

- Market Integrity: Automated detection reduces manual moderation costs by 90%.
- Consumer Confidence: High-trust environments drive better conversion and brand loyalty.
- Strategic Intelligence: Identification of malicious "Review Boosting" or "Smear Campaigns."
- Product Score Correction: Recalculating true star ratings after removing fraud.
- Multi-lingual Support: Regional Indian dialect detection.
- Real-time API: low-latency endpoint for live review vetting.
- Image Deception Detection: Computer vision for product photo verification.
| Name | Role | Responsibilities |
|---|---|---|
| M. Balaji Sakthivel | Project Manager | Data Sourcing & Strategy |
| M. Hasini Reddy | Data Engineer | Pipeline & Feature Engineering |
| Madhav Sreejith | Data Analyst | NLP & Text Analytics |
| Shivani | Analytics Engineer | ML Engine & Trust Framework |
| Kavin K | Business Analyst | BI Dashboard & Validation |
| Priority | Risk Category | Potential Impact | Mitigation Strategy |
|---|---|---|---|
| 🔴 | Data Quality & Labels | Noisy data can weaken ML accuracy | Multi-source verification & robust cleaning pipelines |
| 🟠 | Model Generalization | Favoring specific app UX patterns | Training on a diverse dataset (Q-Comm, Fashion, Marketplace) |
| 🟠 | AI-Generated Spam | LLM-generated reviews bypassing filters | Dynamic RCI scoring based on linguistic complexity |
| 🟡 | System Scalability | Dashboard latency with 70k+ records | Optimized data indexing & efficient pickle model loading |
| 🟡 | Privacy Compliance | Accidental exposure of user PII | Complete anonymization & obfuscation of reviewer handles |
| 🔵 | False Positives | Genuine reviews flagged as deceptive | Human-in-the-loop threshold for high-value moderation |
Developed for 23CSE452 Business Analytics course. All data anonymized for academic purposes.
