This project focuses on analyzing customer purchasing behavior from an e-commerce dataset to derive actionable insights, perform customer segmentation, and understand revenue distribution. The analysis aims to identify high-value customers, analyze sales trends, and provide a foundation for targeted marketing strategies.
The primary goals of this project are:
- To understand overall sales performance and customer purchasing patterns.
- To identify the most valuable customers through RFM (Recency, Frequency, Monetary) analysis.
- To segment customers into distinct groups using K-Means clustering, enabling personalized strategies.
- To analyze revenue contributions by country and over time.
The analysis uses the "Online Retail" dataset, which contains transactional data from a UK-based online retail store.
- Source:
OnlineRetail.csv - Key Features:
InvoiceNo: Invoice number. Unique for each transaction.StockCode: Product (item) code.Description: Product description.Quantity: The quantities of each product per transaction.InvoiceDate: Invoice date and time.UnitPrice: Unit price.CustomerID: Customer number. Unique for each customer.Country: Country where the customer resides.
The project follows a standard data science pipeline:
- Loading Data: Data is loaded from
OnlineRetail.csvusing pandas. Theencoding='ISO-8859-1'is used for correct character interpretation. - Handling Missing Values: Rows with missing
CustomerIDare removed as they are crucial for customer-centric analysis. - Data Cleaning:
- Transactions with
Quantityless than or equal to 0 (returns or cancelled orders) are removed to focus on valid purchases. Countrynames are mapped to numerical indices for consistent processing.- The
StockCodeandDescriptioncolumns are prepared for removal, though the specific line to drop them might need to be explicitly called.
- Transactions with
- Feature Engineering:
Revenue(Quantity * UnitPrice) is calculated for each transaction.
- Overall Metrics: Calculations are performed to derive the average number of orders, average number of objects bought, and average revenue per customer, along with the total gross revenue.
- Output example: "The average person does X orders, buys Y objects, and generates a revenue of: Z."
- Output example: "For a gross total revenue of: $A."
- Revenue by Country: The project calculates the total revenue, percentage of total revenue, average revenue per transaction, and total population (number of transactions) for each country. Results are presented in a sorted DataFrame.
- Monthly Revenue Trend: The total revenue generated across different months is visualized using a bar chart. The
delta = timedelta(days=32)and the specific mask logic are used to segment revenue approximately by calendar months, aiming to reduce repetition across periods.
- RFM Calculation:
- Recency: Days since the customer's last purchase, calculated relative to a
current_dateof2011-12-09. - Frequency: The total number of transactions for each customer.
- Monetary: The total revenue generated by each customer across all their transactions.
- Recency: Days since the customer's last purchase, calculated relative to a
- Outlier Handling: Outliers in Recency, Frequency, and Monetary values are filtered using quantile-based thresholds (specifically, excluding values outside the 3rd and 97th percentiles) to ensure robust clustering.
- Data Visualization (3D Scatter Plot): An initial 3D scatter plot visualizes the distribution of customers based on their raw RFM features before clustering.
- Data Scaling: RFM features are scaled using
StandardScalerto normalize their ranges, ensuring that no single feature dominates the distance calculations in K-Means clustering. - Optimal K Determination: The Silhouette Score is used to evaluate the cohesion and separation of clusters for different numbers of clusters (
k), ranging from 2 to 19. A plot visualizes these scores to help identify an optimalkvalue. - K-Means Clustering: K-Means clustering is applied using the chosen number of clusters (in this code,
n_clusters=3). Each customer is assigned to a specific cluster. - Cluster Visualization: A 3D scatter plot visualizes the clustered customers, with each cluster represented by a different color, providing a clear visual separation of segments.
- Cluster Analysis: The total revenue and total frequency are calculated for each identified cluster. These metrics provide insights into the value and purchasing habits of each customer segment. The percentage contribution of each cluster to the total revenue is also calculated, offering a direct comparison of their impact.
- Overall Performance:
- The average customer places approximately
92orders, purchases around1194objects, and generates an average revenue of2054$. - The gross total revenue for the analyzed period is approximately
8911407.9 $counting outliers tho.
- The average customer places approximately
- Country-wise Contribution:
- The analysis identifies countries like
United KingdomandNetherlandsas significant revenue contributors, highlighting their importance to the business. - The United Kingdom accounts for the largest share of revenue 82%, with a significant average spend per customer.
- The analysis identifies countries like
- Monthly Sales Trends:
- The bar chart depicting monthly revenue helps to visualize sales patterns throughout the year.
- Revenue trends show a consistent flow with potential peaks or dips at specific times of the year, providing insights into seasonal demand or growth.
- Peak revenue months are September,October and November , lower revenue months are: February, April, December
- Customer Segments (K=3 Clusters):
- Based on RFM metrics, customers are segmented into 3 distinct groups, each with unique characteristics and revenue contributions:
- Cluster 0: This segment generated approximately
2163807$with121107transactions, representing49.56%of the total revenue. - Cluster 1: This segment generated approximately
394207$with24356transactions, representing9.03%of the total revenue. - Cluster 2: This segment generated approximately
1807722$with109794transactions, representing41.41 %of the total revenue.
- Cluster 0: This segment generated approximately
- Based on RFM metrics, customers are segmented into 3 distinct groups, each with unique characteristics and revenue contributions:
To run this project:
- Clone the repository:
git clone [YOUR_REPOSITORY_URL] cd [YOUR_PROJECT_NAME_FOLDER] - Ensure you have the dataset:
Place
OnlineRetail.csvin the root directory of the project. - Install dependencies:
pip install numpy pandas matplotlib scikit-learn
- Run the script:
You can execute the Python code cells sequentially in an environment like VS Code with the Python extension or a Jupyter Notebook, or save the entire code as a
.pyfile and run it from your terminal:python your_script_name.py
- Python 3.x
- NumPy
- Pandas
- Matplotlib
- Scikit-learn