Skip to content

kkk0070/Fake-Review-Detection

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

136 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🛡️ Fake Review Detection & Trust Analytics

Project Status Course

📌 Project Overview

Online reviews significantly shape consumer trust and purchasing decisions. This project introduces a data-driven system to detect deceptive reviews, assess reviewer credibility, and quantify the impact of manipulation on consumer trust. By integrating NLP text analytics, behavioral analysis, and trust modeling, we deliver actionable insights for e-commerce platforms and businesses.

📂 Complete Project Structure

Fake-Review-Detection/                     # Root Project Folder 🛡️
├── dashboard images/                      # Individual visualization exports 🖼️
│   ├── Fake vs Genuine Reviews.png       # Model comparison chart
│   ├── Final_dashboard.png                # Full BI Dashboard preview
│   ├── Rating_distribution.png            # Global rating stats
│   ├── Review Distribution by App.png    # Cross-platform breakdown
│   ├── Review_trend_over_time.png        # Time-series spike detector
│   ├── Sentiment_vs_rating.png            # Correlation visualization
│   └── Top_suspicious_reviewers.png       # RCI-flagged user list
├── docs/                                  # Strategic Documentation 📜
│   ├── 01_DATA_ACQUISITION.md             # Sourcing & Ingestion logic
│   ├── 02_EXPLORATORY_ANALYSIS.md         # Statistical deep-dives
│   ├── 03_DATA_CLEANING.md                # NLP Pre-processing steps
│   ├── 04_SENTIMENT_ANALYSIS.md           # Polarity & Subjectivity logic
│   ├── 05_FAKE_REVIEW_DETECTION.md        # ML Engine & RCI Score
│   └── 06_BUSINESS_INTELLIGENCE.md        # Business value & Insights
├── notebooks/                             # Step-by-Step Development 📓
│   ├── 01_Data_Acquisition.ipynb          # Raw data fetching
│   ├── 02_Data_Cleaning.ipynb             # NLP refining & noise removal
│   ├── 03_Sentiment_Analysis.ipynb        # Polarity & Subjectivity experiments
│   ├── 04_Fake_Detection.ipynb            # ML model training (Random Forest)
│   └── 05_Visualization.ipynb             # Chart & Graph generation
├── api.py                                 # Core ML Engine Integration Endpoint
├── ARCHITECTURE.md                        # Technical design & hierarchy
├── COMPREHENSIVE_DOCUMENTATION.md         # Combined project overview
├── DATA_INSIGHTS.md                       # High-level analytical report
├── DEPLOY.md                              # Environment setup guide
├── fake_review_model.pkl                  # Trained Random Forest model
├── feature_names.pkl                      # Saved model feature vectors
├── Fake_Review_Analytics.twbx              # Tableau BI Workbook
└── requirements.txt                       # Python dependency list

🛡️ Project Methodology

Data AcquisitionEDAData CleaningSentiment AnalysisML DetectionBI Dashboard

  1. Data Ingestion: Harvesting 70,000+ localized reviews from Amazon, Flipkart, Zepto, and Shopsy.
  2. Exploratory Data Analysis: Identifying statistical anomalies, rating skewness, and "Review Bursting" patterns.
  3. Advanced Pre-processing: NLP pipeline involving tokenization, stopword removal, and lemmatization.
  4. Sentiment Profiling: Applying lexicon-based scoring to detect rating-sentiment contradictions.
  5. Hybrid ML Detection: Using a Random Forest engine to calculate the Reviewer Credibility Index (RCI).
  6. Business Intelligence: Visualizing real-time fraud trends and product score corrections on a BI dashboard.

📊 Quick Statistics

  • Total Reviews Analyzed: 70,000+
  • Fake Reviews Detected: 1,876 (2.69%)
  • Model Accuracy: 97.31%
  • Data Source: Multi-platform localized product reviews (Amazon, Flipkart, Zepto, etc.)

🗺️ Project Roadmap & Documentation

This project is divided into six distinct stages. For deep dives into the methodology, click the links below:

Stage Focus Area Documentation
📦 Data Acquisition 01_DATA_ACQUISITION.md
📈 Exploratory Analysis 02_EXPLORATORY_ANALYSIS.md
🧼 cleaning & Pre-processing 03_DATA_CLEANING.md
🧠 Sentiment & Linguistic Profiling 04_SENTIMENT_ANALYSIS.md
🤖 Fake Review Detection & ML Engine 05_FAKE_REVIEW_DETECTION.md
📊 Business Intelligence & Visualization 06_BUSINESS_INTELLIGENCE.md

🖼️ Analysis & Visualizations

🏁 Final Analytics Dashboard

Final Dashboard

📈 Key Statistical Findings

  • Rating Distribution: Rating Distribution
  • Review Trend Over Time: Review Trend Over Time
  • Fake vs Genuine Reviews: Fake vs Genuine Reviews
  • Sentiment vs Rating Correlation: Sentiment vs Rating Correlation
  • Top Suspicious Reviewers: Top Suspicious Reviewers

🚀 Business Value & Future Path

Core Business Value

  • Market Integrity: Automated detection reduces manual moderation costs by 90%.
  • Consumer Confidence: High-trust environments drive better conversion and brand loyalty.
  • Strategic Intelligence: Identification of malicious "Review Boosting" or "Smear Campaigns."
  • Product Score Correction: Recalculating true star ratings after removing fraud.

Future Enhancements

  • Multi-lingual Support: Regional Indian dialect detection.
  • Real-time API: low-latency endpoint for live review vetting.
  • Image Deception Detection: Computer vision for product photo verification.

👥 The Team

Name Role Responsibilities
M. Balaji Sakthivel Project Manager Data Sourcing & Strategy
M. Hasini Reddy Data Engineer Pipeline & Feature Engineering
Madhav Sreejith Data Analyst NLP & Text Analytics
Shivani Analytics Engineer ML Engine & Trust Framework
Kavin K Business Analyst BI Dashboard & Validation

⚠️ Risk Assessment & Mitigation

Priority Risk Category Potential Impact Mitigation Strategy
🔴 Data Quality & Labels Noisy data can weaken ML accuracy Multi-source verification & robust cleaning pipelines
🟠 Model Generalization Favoring specific app UX patterns Training on a diverse dataset (Q-Comm, Fashion, Marketplace)
🟠 AI-Generated Spam LLM-generated reviews bypassing filters Dynamic RCI scoring based on linguistic complexity
🟡 System Scalability Dashboard latency with 70k+ records Optimized data indexing & efficient pickle model loading
🟡 Privacy Compliance Accidental exposure of user PII Complete anonymization & obfuscation of reviewer handles
🔵 False Positives Genuine reviews flagged as deceptive Human-in-the-loop threshold for high-value moderation

Developed for 23CSE452 Business Analytics course. All data anonymized for academic purposes.

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 87.4%
  • JavaScript 11.4%
  • Other 1.2%