Problem Statement • Dataset • EDA • Data Cleaning & Transformation • Data Visualisation • Executive Summary • Recommendations • Extensions
The goal of this project is to categorise customers of an eCommerce website into distinct segments using RFM (Recency, Frequency, Monetary) analysis. By identifying these segments, targeted marketing strategies can be created to enhance customer engagement and ultimately improve sales performance.
RFM analysis is a marketing technique used to identify and segment customers based on 3 key metrics:
- Recency (R): Measures how recently a customer made their last purchase
- Frequency (F): Tracks how often a customer makes purchases within a specific timeframe
- Monetary (M): Assesses the total amount of money a customer spends
By analysing these three metrics, businesses can classify customers into distinct segments (like "Champions," "At Risk," and "New Customers") and tailor marketing strategies accordingly. This enables the business to improve customer retention and drive sales growth.
Example image of RFM Segments dashboard by John Abbasi
The eCommerce purchase history from electronics store dataset from Kaggle is used for this project. The dataset was transformed into a SQLite database with the following fields:
event_time: The date & time the event occurred in yyyy-MM-dd HH:mm:ss UTC formatorder_id: Order IDproduct_id: Product IDcategory_id: Product category IDcategory_code: Name for given category (if present)brand: Brand nameprice: Product priceuser_id: User ID associated with event
- The dataset extends until 2020-11-21. However, it contains a significant number of purchases from 1970-01-01, this is likely some form of error as the only years present in the dataset are 2020 and 1970
userandpricecontained null values, indicating missing data. This data encompasses 16% of the total rows- 1,637,398 of
user_idcontained an empty string, which equated to 62% the total rows. This required further investigation as this was a significant proportion of the dataset - All
pricevalues that were null also had nulluser_idvalues. It was discovered these records'category_codeandbrandfields contained thepriceanduser_idvalues for these records respectively - 1,198,255 rows had duplicate
order_idvalues. If aorder_idwas duplicated, it indicates that the user purchased multiple products in a single order
Full description of EDA can be found in EDA.sql
To create the RFM model, the dataset was cleaned and transformed.
The transformed dataset contains the following fields:
user: Corresponds touser_idlast_purchase: The date in yyyy-MM-dd format ofuser's last purchasenum_purchase: The total number of times the user has ordered from the websitetotal_spend: The total amount ($) the user has spent on the websiterecency_percentile: Score out of 5 based on how recent theuserhas purchased from the websitefrequency_percentile: Score out of 5 based on how many times theuserhas purchased from the websitemonetary_percentile: Score out of 5 based the total amount theuserhas spent on the website
Data cleaning summary:
- Only purchases from the year 2020 were included to avoid skewing the recency percentile scores
- The null
priceanduser_idvalues that were misplaced in thecategory_codeandbrandfields were corrected - Empty string
user_ids were not included as they could not be assigned scores - As empty string
user_ids were not included, a large proportion of the dataset was deemed unusable
A dashboard on Tableau was created to visualise the data. Link to live dashboard
- In 2020, the website has driven over 234,474 unique customers who have collectively purchased a total of 403,245 times and spent $118,372,962.
- The top 20 percentile of highest paying customers (Monetary score of 5), on average, spent $1,719. This is 3.5x more than the next 20 percentile as this group spent an average of $485.
- The most frequent customers made 4 purchases in 2020 while the lower 80th percentile purchased an average of just 1-1.6 times.
- The highest value customers with an RFM score of 5,5,5; 5,5,4; or 5,5,3 are in the Champions segment. This equates to 13,424 customers or 5.7% of our total users. Collectively, this group of customers have ordered 43,446 times, which equates to 10.8% of the sites total purchases. In terms of total spent, this equates to $18,505,684, or 15.6% of the sites total sales.
- Loyal Customers are the next most valued customers, making up a total of 15.2% of total customers. This is the 4th largest segment as their segment covers a wider range of RFM scores. This segment purchased a total of $48,483,112 over 114,226 orders. This is significant as it equates to 41.0% of the sites total sales across 28.3% of the total purchases. This means that despite Champions and Loyal Customers making up only 20.9% of the site's users, they derive 51.8% of the sites total sales.
- Customers who have a high monetary value but have not purchased recently are placed in the Can't Lose Them and At Risk segments. Combined, they make up 21.8% of the site's customers.
- Regarding the frequency score, the average number of purchases for the lower 3 buckets are all 1. This means the customers may not well-segmented, as customers with only 1 purchase can be arbitrarily placed in any of these 3 buckets. This may reduce the effectiveness of the RFM model in identifying distinct behavioral patterns and customer value within certain segments.
- Given the limitation regarding the frequency segmentation, it may be best to focus on the segments which have a frequency + monetary score of 4+ as users with a frequency score in the 1-3 range exhibit the same behaviour. This means segments other than Champions, Loyal Customers, Can't Lose Them, and At Risk may be arbitrary and not useful.
- As Champions and Loyal Customers make up a majority of the sites total sales, it is important to keep these customers engaged. Consider trialing a loyalty program that offers exclusive discounts or special offers to encourage repeat purchases.
- For the Can't Lose Them and At Risk segments, introducing limited-time discounts can serve as an effective strategy for re-engagement. While these offers could potentially reduce margins in the short term, the long-term value of reactivating high-value customers may be far more significant for the company’s overall performance. By encouraging these customers to return, we can foster loyalty and ultimately increase their lifetime value, making the investment worthwhile in the broader context of customer retention and revenue growth.
- To assess the effectiveness of these strategies, A/B testing can be employed to compare responses from different customer segments. For the Champions and Loyal Customers, trialing a loyalty program with exclusive discounts or offers can be tested against a control group that does not receive these incentives. This can determine the impact on repeat purchases and engagement.
- Similarly, limited-time discounts for the Can't Lose Them and At Risk segments can be analysed through A/B testing, where one group receives the discounts while another group does not. By carefully measuring metrics such as conversion rates, customer reactivation, and overall sales performance, we can draw insights regarding the long-term value these strategies bring, ensuring that our approach not only engages customers effectively but also aligns with the profitability goals of the website.
To enhance this project further, implementing clustering techniques using Python can be a valuable extension. By applying algorithms such as K-Means, more nuanced customer segments can be created beyond the existing RFM scores. These techniques will allow patterns and group customers to be identified based on their purchasing behaviour, preferences, and interactions with the brand, allowing for a more granular analysis.


