Skip to content

Fabry200/Retail_sales_analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

E-commerce Customer Analytics & RFM Segmentation

Overview

This project focuses on analyzing customer purchasing behavior from an e-commerce dataset to derive actionable insights, perform customer segmentation, and understand revenue distribution. The analysis aims to identify high-value customers, analyze sales trends, and provide a foundation for targeted marketing strategies.

Table of Contents

  1. Project Goal
  2. Dataset
  3. Methodology
  4. Key Insights & Results
  5. Setup and Usage
  6. Technologies Used

Project Goal

The primary goals of this project are:

  • To understand overall sales performance and customer purchasing patterns.
  • To identify the most valuable customers through RFM (Recency, Frequency, Monetary) analysis.
  • To segment customers into distinct groups using K-Means clustering, enabling personalized strategies.
  • To analyze revenue contributions by country and over time.

Dataset

The analysis uses the "Online Retail" dataset, which contains transactional data from a UK-based online retail store.

  • Source: OnlineRetail.csv
  • Key Features:
    • InvoiceNo: Invoice number. Unique for each transaction.
    • StockCode: Product (item) code.
    • Description: Product description.
    • Quantity: The quantities of each product per transaction.
    • InvoiceDate: Invoice date and time.
    • UnitPrice: Unit price.
    • CustomerID: Customer number. Unique for each customer.
    • Country: Country where the customer resides.

Methodology

The project follows a standard data science pipeline:

Data Preprocessing

  • Loading Data: Data is loaded from OnlineRetail.csv using pandas. The encoding='ISO-8859-1' is used for correct character interpretation.
  • Handling Missing Values: Rows with missing CustomerID are removed as they are crucial for customer-centric analysis.
  • Data Cleaning:
    • Transactions with Quantity less than or equal to 0 (returns or cancelled orders) are removed to focus on valid purchases.
    • Country names are mapped to numerical indices for consistent processing.
    • The StockCode and Description columns are prepared for removal, though the specific line to drop them might need to be explicitly called.
  • Feature Engineering: Revenue (Quantity * UnitPrice) is calculated for each transaction.

Exploratory Data Analysis (EDA)

  • Overall Metrics: Calculations are performed to derive the average number of orders, average number of objects bought, and average revenue per customer, along with the total gross revenue.
    • Output example: "The average person does X orders, buys Y objects, and generates a revenue of: Z."
    • Output example: "For a gross total revenue of: $A."
  • Revenue by Country: The project calculates the total revenue, percentage of total revenue, average revenue per transaction, and total population (number of transactions) for each country. Results are presented in a sorted DataFrame.
  • Monthly Revenue Trend: The total revenue generated across different months is visualized using a bar chart. The delta = timedelta(days=32) and the specific mask logic are used to segment revenue approximately by calendar months, aiming to reduce repetition across periods.

Customer Segmentation (RFM Analysis & K-Means Clustering)

  • RFM Calculation:
    • Recency: Days since the customer's last purchase, calculated relative to a current_date of 2011-12-09.
    • Frequency: The total number of transactions for each customer.
    • Monetary: The total revenue generated by each customer across all their transactions.
  • Outlier Handling: Outliers in Recency, Frequency, and Monetary values are filtered using quantile-based thresholds (specifically, excluding values outside the 3rd and 97th percentiles) to ensure robust clustering.
  • Data Visualization (3D Scatter Plot): An initial 3D scatter plot visualizes the distribution of customers based on their raw RFM features before clustering.
  • Data Scaling: RFM features are scaled using StandardScaler to normalize their ranges, ensuring that no single feature dominates the distance calculations in K-Means clustering.
  • Optimal K Determination: The Silhouette Score is used to evaluate the cohesion and separation of clusters for different numbers of clusters (k), ranging from 2 to 19. A plot visualizes these scores to help identify an optimal k value.
  • K-Means Clustering: K-Means clustering is applied using the chosen number of clusters (in this code, n_clusters=3). Each customer is assigned to a specific cluster.
  • Cluster Visualization: A 3D scatter plot visualizes the clustered customers, with each cluster represented by a different color, providing a clear visual separation of segments.
  • Cluster Analysis: The total revenue and total frequency are calculated for each identified cluster. These metrics provide insights into the value and purchasing habits of each customer segment. The percentage contribution of each cluster to the total revenue is also calculated, offering a direct comparison of their impact.

Key Insights & Results

  • Overall Performance:
    • The average customer places approximately 92 orders, purchases around 1194 objects, and generates an average revenue of 2054$.
    • The gross total revenue for the analyzed period is approximately 8911407.9 $ counting outliers tho.
  • Country-wise Contribution:
    • The analysis identifies countries like United Kingdom and Netherlands as significant revenue contributors, highlighting their importance to the business.
    • The United Kingdom accounts for the largest share of revenue 82%, with a significant average spend per customer.
  • Monthly Sales Trends:
    • The bar chart depicting monthly revenue helps to visualize sales patterns throughout the year.
    • Revenue trends show a consistent flow with potential peaks or dips at specific times of the year, providing insights into seasonal demand or growth.
    • Peak revenue months are September,October and November , lower revenue months are: February, April, December
  • Customer Segments (K=3 Clusters):
    • Based on RFM metrics, customers are segmented into 3 distinct groups, each with unique characteristics and revenue contributions:
      • Cluster 0: This segment generated approximately 2163807$ with 121107 transactions, representing 49.56% of the total revenue.
      • Cluster 1: This segment generated approximately 394207$ with 24356 transactions, representing 9.03% of the total revenue.
      • Cluster 2: This segment generated approximately 1807722$ with 109794 transactions, representing 41.41 % of the total revenue.

Setup and Usage

To run this project:

  1. Clone the repository:
    git clone [YOUR_REPOSITORY_URL]
    cd [YOUR_PROJECT_NAME_FOLDER]
  2. Ensure you have the dataset: Place OnlineRetail.csv in the root directory of the project.
  3. Install dependencies:
    pip install numpy pandas matplotlib scikit-learn
  4. Run the script: You can execute the Python code cells sequentially in an environment like VS Code with the Python extension or a Jupyter Notebook, or save the entire code as a .py file and run it from your terminal:
    python your_script_name.py

Technologies Used

  • Python 3.x
  • NumPy
  • Pandas
  • Matplotlib
  • Scikit-learn

About

Data science project about online business sales

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors