A comprehensive analytics platform for scraping, analyzing, and serving products data from Amazon. This project includes web scraping capabilities, a REST API, and a RAG-based conversational AI system for data insights.
-
Web Scraping
- Automated scraping of products from Amazon
- Collects detailed product information and user reviews
- Scheduled scraping with configurable intervals
- Data persistence in PostgreSQL
-
REST API
- Built with FastAPI for high performance
- Comprehensive product search and filtering
- Review management
- Pagination support
- Top products analytics
-
AI Insights
- RAG (Retrieval Augmented Generation) system
- Conversational interface for data analysis
- Built on Weaviate with OpenAI integration
- Backend: Python 3.9+, FastAPI
- Database: PostgreSQL, SQLAlchemy
- Scraping: Playwright, BeautifulSoup4
- AI/ML: Weaviate, OpenAI, Docker
- Frontend: Streamlit
amazon_products_analytics/
├── app/
│ ├── api/
│ │ ├── routers/
│ │ ├── __init__.py
│ │ ├── exceptions.py
│ │ └── response_models.py
│ ├── frontend/
│ │ ├── static/
│ │ ├── __init__.py
│ │ └── app.py
│ ├── scraper/
│ │ ├── amazon_scraper.py
│ │ └── scheduler.py
│ ├── rag/
│ │ ├── chatbot.py
│ │ └── pipeline.py
│ ├── __init__.py
│ ├── main.py
│ ├── config.py
│ └── database.py
├── scripts/
├── tests/
├── .gitignore
├── docker-compose.yml
├── Dockerfile
├── README.md
└── requirements.txt- Python 3.9+
- Docker
- PostgreSQL
- OpenAI API Key
- Weaviate
- Chrome WebDriver
- Clone the repository:
git clone https://github.com/abir0/amazon-products-analytics.git
cd amazon-products-analytics- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: .\venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt-
Configure environment variables in
.envfile in the root directory. -
Set up PostgreSQL database named
amazon_products.
sudo -u postgres psql
CREATE DATABASE amazon_products;
\q- Start the weaviate database using Docker Compose:
cd app/rag
docker compose up -d
cd -- Start the backend app:
python3 app/app.py- Start the frontend app:
streamlit run app/frontend/app.pypython3 app/app.pyThe API will be available at http://localhost:8001
Once the server is running, visit:
- Swagger UI:
http://localhost:8001/docs
python3 app/app.pycurl -X 'POST' \
'http://localhost:8001/rag/query?question=hi' \
-H 'accept: application/json' \
-d ''-
GET /products: Search, filter, and sort products.
- Query params:
brand,min_price,max_price,page,limit,sort_by - Example:
GET /products?brand=Seiko&min_price=100&max_price=500&page=1&limit=10
- Query params:
-
GET /products/top: Retrieve top-rated products based on reviews.
- Example:
GET /products/top
- Example:
-
GET /products/{product_id}/reviews: Get reviews for a specific product.
- Query params:
page,limit - Example:
GET /products/123/reviews?page=1&limit=5
- Query params:
This project is containerized and can be deployed to any cloud provider that supports Docker containers.
Here's my way of deploying this app using AWS:
-
API & Scraping Service:
- EC2 for hosting the FastAPI app. EC2 is the fully-managed service so its really customizable.
- ECS for container orchestration. This will be useful when there's a lot of different products to scrape.
-
Database:
- Amazon RDS for PostgreSQL database. This is one of the most versatile DB services as it can be useful to migrate into a different open-source databases later on. It is also very configurable. Such as if we need to sync with other DB such as Elasticsearch, we can readily configure it.
-
AI Model:
- SageMaker for deploying the LLM model as an API endpoint. This can be auto-scaled so deploying fine-tuned models from SageMaker can be really easy. Also we can setup custom training and evaluation pipeline through this service.
-
Storage:
- S3 for storing scraped product images. S3 offers different tiers of persistence of the storage. So, this is a must for storing images and using CloudFront or CDN's to make the access time tailored to specific zones.
-
Monitoring:
- CloudWatch for logs and performance monitoring of the services.
Created by Abir.