This project represents a deep dive into data analytics and visualization, centered around the "Online Retail Data Set"—a comprehensive dataset with over 541,000 transactional records from an online retailer. Undertaken as Task 3, the objective was to create actionable visualizations to support a hypothetical retail expansion strategy, addressing specific questions posed by the CEO and CMO. The process involved meticulous data cleaning, advanced visualization design using Tableau Public, and the integration of storytelling to derive strategic insights. This README provides an exhaustive documentation of every phase, from initial data acquisition to the final publication of interactive visualizations, serving as both a technical record and a testament to my evolving skills in data science and user experience design.
The project builds on my previous experiences, such as troubleshooting Python indentation errors in a trading app, refining overlapping diagrams in LaTeX, and creating structured Kanban dashboards. The resulting work—four Tableau Public visualizations—offers a narrative-driven approach to retail strategy, blending technical rigor with human-centered design principles. This document is intended for portfolio use, educational sharing, and professional networking, particularly on platforms like GitHub and LinkedIn.
- Data Quality: Clean the dataset to eliminate invalid entries (e.g., negative quantities, zero/negative unit prices) and ensure analytical integrity.
- Visualization Goals: Develop four distinct, interactive visualizations to answer executive queries about revenue trends, market performance, customer value, and global demand.
- Tool Utilization: Leverage Tableau Public for its free, robust visualization capabilities, aligning with my interest in clear, user-friendly diagrams.
- Insight Generation: Provide data-driven insights to guide retail expansion, enhancing decision-making for the CEO and CMO.
- Documentation: Create a detailed record of the process to reflect my technical growth and share with the community.
- Programming Language: Python 3.x (for data preprocessing).
- Libraries:
pandas: Data manipulation and analysis.numpy: Numerical computations.openpyxl: Excel file handling.psutil: System resource monitoring.matplotlib: Initial visualization testing (optional).
- Visualization Tool: Tableau Public Desktop (latest version, e.g., 2023.3 as of April 2025).
- Operating System: Windows (based on your
PS C:\Users\AMIT\downloads\Data_Cleaningcontext). - Version Control: Git (for managing code and documentation).
- Editor: Notepad++ or any IDE (e.g., VS Code, per past coding discussions).
- Data Source: "Online_Retail_Data_Set.xlsx" (original) and "Online_Retail_Data_Set_Cleaned.csv" (final).
- Source Identification: The project started with the "Online Retail Data Set," sourced from the UCI Machine Learning Repository (referenced in task resources). The dataset, detailed by Daqing Chen et al. (2012), includes transactional data from 2010-2011.
- File Download: Obtained the file
Online_Retail_Data_Set.xlsxfrom the provided resource link, saving it toC:\Users\AMIT\downloads\Data_Cleaning. - Initial Exploration: Opened the file in Excel to confirm columns (
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country) and approximate row count (541,909).
- Python Installation: Verified Python 3.x was installed (run
python --versionin PowerShell). - Dependency Installation: Installed required libraries via pip:
pip install pandas openpyxl numpy psutil- Encountered a
ModuleNotFoundErrorforopenpyxlduring initial runs, resolved by installing it as advised.
- Directory Organization: Created a working directory (
Data_Cleaning) to storeOnline_Retail_Data_Set.xlsx,clean.py, and output files.
- Objective: Address task requirements to remove returns (negative quantities) and erroneous unit prices (zero or negative).
- Script Development:
- Initial Code: Wrote
clean.pyto load and process the data.- Used
pandas.read_excelwithengine='openpyxl'to handle the Excel format. - Implemented
psutil.virtual_memory()to monitor system resources (initial threshold 2GB).
- Used
- First Run Issues:
- Executed
python clean.pyon April 17, 2025, 23:01, logging a warning: "Low memory available (<2GB). Consider increasing RAM or processing in chunks." - Encountered an error: "Insufficient system resources; aborting" due to 1.25GB available memory.
- Adjusted threshold to 1GB and removed abort logic, enabling chunked processing.
- Executed
- Code Refinement:
- Added
load_and_validate_datafunction to read the file, checking for required columns (Quantity,UnitPrice). - Implemented
clean_datato filterQuantity >= 1andUnitPrice > 0, dropping NA values. - Introduced
deduplicate=Trueto remove 5,268 duplicate rows, reducing the dataset to ~530,104 initially. - Capped
UnitPriceat the 99th percentile (~16.98) andRevenueat (~179.00) usingquantileandclip. - Optimized data types (
int32forQuantity,float32forUnitPriceandRevenue) insave_cleaned_data.
- Added
- Second Run Success:
- Reran on April 17, 2025, 23:05, with 1.06GB memory, completing in ~41 seconds.
- Logged shape (530,104, 9), saved as
Online_Retail_Data_Set_Cleaned.csv.gzwith a backup.
- Final Adjustment:
- Reran with deduplication, reducing to 525,836 rows, saved as
Online_Retail_Data_Set_Cleaned.csv. - Validated with
describe()output, confirming clean data.
- Reran with deduplication, reducing to 525,836 rows, saved as
- Initial Code: Wrote
- Challenges:
- Initial
chunksizeerror inread_excel(unsupported parameter) was fixed by loading the full dataset with optimizeddtype. - Memory constraints required iterative optimization, drawing from past Python troubleshooting (e.g., indentation fixes).
- Initial
- Output: A compressed, optimized CSV ready for visualization.
- Tool Selection: Chose Tableau Public for its free access and interactive features, aligning with my preference for clear visuals (e.g., fixing label overlaps from past diagram projects).
- Setup:
- Installation: Downloaded Tableau Public (latest version) from the task resource link, installed on Windows.
- Data Connection: Connected
Online_Retail_Data_Set_Cleaned.csvin Tableau Public, verifying ~525,836 rows. - Data Type Check: Ensured
InvoiceDateas Date, numerics as appropriate via right-click > Change Data Type.
- Visualization Creation:
- Q1: Monthly Revenue Trends in 2011 (Line Chart)
- Dragged
InvoiceDateto Columns (Month granularity), filtered to 2011. - Added
Revenueto Rows, selected Line chart. - Enhanced with blue line (#4C78A8), trend line, tooltips ("Month: <Month(InvoiceDate)>, Revenue: <SUM(Revenue)> $"), and filter.
- Title: "Monthly Revenue Trends in 2011 (Seasonal Forecast)", formatted with 14pt font, gridlines.
- Dragged
- Q2: Top 10 Expansion Markets (Side-by-Side Bar Chart)
- Filtered
Countryto exclude "United Kingdom", applied Top 10 by Revenue. - Used
Measure NamesandMeasure Valuesfor side-by-side bars, colored green (#80C040) for Revenue, light green (#C0E0A0) for Quantity. - Added labels, tooltips ("Country: , Revenue: <SUM(Revenue)> $, Quantity: <SUM(Quantity)>"), and highlight.
- Title: "Top 10 Expansion Markets by Revenue & Quantity (Excl. UK)".
- Filtered
- Q3: Top 10 High-Value Customers (Column Chart)
- Filtered
CustomerIDfor non-null, applied Top 10 by Revenue, sorted descending. - Used red-to-orange gradient (#FF6F61 to #FFB300), added labels, tooltips ("CustomerID: , Revenue: <SUM(Revenue)> $").
- Title: "Top 10 High-Value Customers by Revenue".
- Filtered
- Q4: Global Demand Opportunities (Filled Map)
- Set
Countryto Detail,Quantityto Color, filtered out "United Kingdom". - Applied yellow-to-red gradient (#FFC107 to #D32F2F), fit to Entire View, added tooltips ("Country: , Demand: <SUM(Quantity)>").
- Title: "Global Demand Opportunities (Excl. UK)".
- Set
- Q1: Monthly Revenue Trends in 2011 (Line Chart)
- Export Challenge:
- Noticed Worksheet > Export > Image was unavailable. Used workaround: right-click > Copy > Image, pasted into Paint, saved as PNGs (e.g.,
Q1_Revenue_2011.png).
- Noticed Worksheet > Export > Image was unavailable. Used workaround: right-click > Copy > Image, pasted into Paint, saved as PNGs (e.g.,
- Publishing:
- Uploaded each visualization to Tableau Public (
profile/aryan.7571), ensuring interactivity and public access.
- Uploaded each visualization to Tableau Public (
- Insights:
- Q1: December peak (30% of revenue) indicates holiday impact—forecasting opportunity.
- Q2: Netherlands ($1.2M revenue, 50K units) suggests high-value market potential.
- Q3: Top customer (e.g., 12346, $250K) highlights retention focus.
- Q4: Germany (150K units) and France (120K units) signal expansion targets.
- Storytelling: Used color psychology (blue for trust, green for growth, warm tones for engagement) and annotations to connect data to strategy, reflecting past UX learnings (e.g., Kanban dashboards).
- README Creation: Wrote this detailed Markdown file to document every step.
- LinkedIn Preparation: Crafted a post with carousel images and Tableau links for engagement.
- GitHub Setup: Planned a repo to host code, README, and visuals.
- Memory Constraints: System with 1.06GB available required script optimization (lowered threshold to 0.5GB, chunked processing).
- Export Issue: Tableau Public’s missing export option was resolved with manual copy-paste, ensuring all visuals were captured.
- Learning Curve: As a coding beginner (e.g., past indentation errors), I relied on iterative guidance, documented here for growth.
- Automation: Develop a Python script with
tableau-api-libto automate image export. - Dashboard Integration: Create a unified Tableau dashboard linking all visuals.
- Tool Comparison: Explore Power BI for alternative insights.
- Advanced Analytics: Add predictive modeling (e.g., ARIMA) to Q1 for enhanced forecasting.
- Data Cleaning: Mastered Python-based data preprocessing with Pandas.
- Visualization: Gained proficiency in Tableau Public, applying UX principles.
- Storytelling: Enhanced ability to narrate data insights, aligning with business strategy.
- Problem-Solving: Overcame memory and export challenges with creative workarounds.
This project encapsulates my journey from raw data to strategic visualization, reflecting skills tailored for roles like Technical Specialist at Barclays. The visualizations are not mere charts but a narrative tool, blending technical expertise with human-centered design. Explore the interactive versions below to see the full impact!
- Q1: Monthly Revenue Trends in 2011
- Q2: Top 10 Expansion Markets by Revenue & Quantity
- Q3: Top 10 High-Value Customers by Revenue
- Q4: Global Demand Opportunities Map (Excl. UK)
- Clone:
git clone <repo-url>to access files. - Explore: View
clean.pyfor the cleaning script and PNGs for visuals. - Contribute: Suggest improvements or fork the repo!
Gratitude to the xAI community, task resource providers, and my peers for support. Special thanks to the iterative guidance that shaped this project.