This project simulates the role of a Data Analyst in an international e-commerce company. The goal is to analyze sales, customers, products, and countries to answer key business questions such as:
- Who are the top customers?
- Which products are the best sellers?
- Which countries generate the most revenue?
- What are the sales trends over time?
Source: UCI Machine Learning Repository – Online Retail Dataset (UK)
- Real transactions from a UK-based online retailer (2010–2011)
- Fields: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country
- Format: Provided as .xlsx, converted to PostgreSQL and pandas DataFrame for analysis
The project follows a complete data analysis pipeline, using industry-standard tools:
- Data Import: Loaded the dataset into a PostgreSQL database (SQL)
- Data Exploration & Cleaning: Wrote SQL queries for initial exploration and cleaning
- Data Extraction: Imported query results into Python using
pandasandSQLAlchemy - Data Cleaning & Transformation: Performed further cleaning, transformation, and aggregation in
pandas - Visualization: Created insightful visualizations using
matplotlib - Export: Exported results to CSV for reporting and documentation
- Documentation: Documented the project with a README and visualization images
The analysis addresses real-world business questions, including:
- How many orders were placed each month?
- What is the total and average revenue per country?
- What are the top 10 most sold products?
- Who are the top 10 customers by total spending?
- Which customers purchase most frequently?
- Which customers have the highest average order value?
- Which products are most frequently returned (negative quantity)?
- Is there seasonality in product sales?
The project includes three visualizations:
- Top 10 countries by total revenue
- Top 10 most sold products
- Distribution of sales among top products
- SQL (PostgreSQL) for data storage and querying
- Python (
pandas,sqlalchemy,matplotlib) - CSV/Excel for data export
dataCleaning.py: Data cleaning and quality checksdataVisualization.py: Data extraction, analysis, and visualizationdb_config.py: Database connection configurationanalysis_queries.sql: SQL queries for exploration and aggregationOnlineRetailCSV.csv/OnlineRetail_cleaned.csv: Raw and cleaned datasets- Visualization images:
Top 10 Most Sold Products.pngDistribution of Top 10 Most Sold Products.pngTotal Revenue - Top 10 Countries.png
README.md: Project documentation