A comprehensive natural language processing project for Amazon product review sentiment analysis, developed as part of MIT's SCM256 course. This project leverages state-of-the-art transformer models including BERT and RoBERTa for fine-grained sentiment classification and supply chain insights.
This repository contains a complete NLP pipeline for analyzing Amazon Grocery & Gourmet Food reviews to extract actionable business intelligence for supply chain management. The project combines:
- Advanced NLP Models: BERT, RoBERTa fine-tuning for sentiment classification
- Large-Scale Data Processing: Efficient handling of millions of Amazon reviews
- Multi-Model Comparison: Comparative analysis across different transformer architectures
- Supply Chain Applications: Sentiment-driven insights for inventory and logistics optimization
Understanding customer sentiment at scale is crucial for supply chain optimization. This project addresses:
- Demand Forecasting: Sentiment trends as leading indicators of demand changes
- Quality Control: Early detection of product quality issues through review analysis
- Supplier Performance: Sentiment analysis for supplier and product evaluation
- Risk Management: Identifying potential supply chain disruptions through customer feedback
Amazon_Local_Transfer/
βββ Amazon_EDA_v2.ipynb # Comprehensive exploratory data analysis
βββ BERT_for_Amazon/ # Core BERT implementation
β βββ Amazon_sentiment_analysis.py # Main BERT training pipeline
β βββ BERT_delayed.py # Delayed shipment analysis
β βββ BERT_expire.py # Product expiration analysis
β βββ GPT_call.py # GPT API integration
β βββ Deepseek_call.py # Deepseek API integration
β βββ finetuned_model/ # Trained model artifacts
βββ BERT_for_Amazon_Expanded/ # Scaled BERT implementation
β βββ Amazon_expanded_optimized_v2.py # Optimized training pipeline
β βββ Amazon_expanded_distributed.py # Distributed training setup
β βββ daily_trained/ # Incremental training models
βββ BERT_for_Amazon_LowRating/ # Low-rating specific analysis
βββ BERT_for_Amazon_combined/ # Multi-model ensemble approach
βββ roBERTa_for_Amazon/ # RoBERTa model implementation
βββ checkpoint-*/ # Model checkpoints and training states
- BERT Base: Fine-tuned for 5-class sentiment classification (1-5 stars)
- RoBERTa: Robustly optimized BERT approach for improved performance
- Ensemble Methods: Combined predictions from multiple transformer models
- Specialized Models: Targeted analysis for low ratings and specific use cases
- Preprocessing Pipeline: HTML cleaning, tokenization, lemmatization
- Feature Engineering: Combined review text and summary analysis
- Multi-threading: Parallel processing for large-scale data handling
- Memory Optimization: Efficient data loading and batch processing
- Delayed Shipment Analysis: Sentiment correlation with logistics performance
- Product Expiration Tracking: Quality control through review sentiment
- Supplier Performance: Vendor evaluation through customer feedback
- Demand Signal Detection: Early warning systems for inventory management
- GPU Optimization: CUDA-enabled training for large models
- Distributed Training: Multi-node training capabilities
- Incremental Learning: Daily model updates with new review data
- Checkpoint Management: Robust model versioning and recovery
pip install torch transformers datasets scikit-learn pandas numpy nltk beautifulsoup4
pip install polars matplotlib seaborn tqdm- Data Exploration:
jupyter notebook Amazon_EDA_v2.ipynb- BERT Training:
python BERT_for_Amazon/Amazon_sentiment_analysis.py- Optimized Training:
python BERT_for_Amazon_Expanded/Amazon_expanded_optimized_v2.py- Amazon Reviews Dataset: Grocery & Gourmet Food reviews in JSONL format
- Required Fields: reviewText, summary, overall (rating), asin, reviewTime
- Scale: Optimized for millions of reviews with efficient memory usage
- Accuracy: Multi-class classification accuracy (1-5 stars)
- Precision/Recall/F1: Per-class performance analysis
- Confusion Matrix: Detailed classification performance
- Training Loss: Convergence monitoring and optimization
- BERT vs RoBERTa: Comparative analysis of transformer architectures
- Fine-tuning Strategies: Different approaches to domain adaptation
- Ensemble Performance: Combined model predictions for improved accuracy
- Computational Efficiency: Training time and resource utilization analysis
- Rating Distribution: Analysis of 1-5 star rating patterns
- Temporal Trends: Seasonal and temporal sentiment variations
- Product Categories: Category-specific sentiment characteristics
- Review Length Correlation: Relationship between review length and sentiment
- Quality Indicators: Sentiment as early warning for quality issues
- Logistics Performance: Correlation between delivery experience and sentiment
- Supplier Insights: Vendor performance through customer feedback analysis
- Demand Forecasting: Sentiment trends as demand predictors
- Input Processing: Tokenization with BERT/RoBERTa tokenizers
- Feature Combination: Review text + summary concatenation
- Classification Head: 5-class sentiment classification layer
- Training Strategy: Fine-tuning with domain-specific data
- Batch Processing: Efficient data loading with optimal batch sizes
- Memory Management: Gradient checkpointing and mixed precision training
- Parallel Processing: Multi-threading for data preprocessing
- Model Checkpointing: Regular saving for training recovery
- GPT Integration: Comparison with OpenAI models
- Deepseek Integration: Alternative LLM comparison
- Model Serving: Inference pipeline for real-time predictions
- Batch Prediction: Efficient processing of large review datasets
- Inventory Planning: Sentiment-driven demand forecasting
- Quality Assurance: Early detection of product quality issues
- Supplier Management: Data-driven supplier performance evaluation
- Customer Experience: Proactive identification of service issues
- Risk Mitigation: Early warning system for potential issues
- Cost Reduction: Optimized inventory based on sentiment trends
- Revenue Enhancement: Improved customer satisfaction through insights
- Strategic Planning: Long-term trend analysis for business decisions
This project demonstrates advanced concepts in:
- Transfer Learning: Fine-tuning pre-trained transformers for domain-specific tasks
- Large-Scale NLP: Processing millions of text documents efficiently
- Multi-Model Ensembles: Combining different architectures for improved performance
- Supply Chain Analytics: Practical NLP applications in operations management
- Devlin, J., et al. (2018). BERT: Pre-training of Deep Bidirectional Transformers
- Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach
- Rogers, A., et al. (2020). A Primer on Neural Network Models for Natural Language Processing
- Qiu, X., et al. (2020). Pre-trained Models for Natural Language Processing: A Survey
This project is licensed under the MIT License - see the LICENSE file for details.
Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
Developed as part of MIT SCM256 - advancing the application of natural language processing in supply chain management and operations research.