This project is a comprehensive data analytics study of a customer database for a retail food company. The goal was to identify distinct customer segments, analyze spending behaviors, and optimize marketing campaign performance.
Using Python, we transitioned from raw data cleaning to K-Means Clustering and Linear Regression Residual Analysis to uncover "hidden" spending patterns that demographics alone could not explain.
The dataset (ifood_df.csv) contains 2,205 customer records (after cleaning) with 39 features, including:
- Demographics: Age, Income, Marital Status, Education, Household Structure (Kids/Teens).
- Behavioral: Recency (days since last purchase), Complaints, Web/Store/Catalog Visits.
- Spending (
Mnt): Monetary value spent on Wines, Fruits, Meat, Fish, Sweets, and Gold. - Campaign History: Acceptance of 5 previous marketing campaigns and the current response.
- Python 3.x
- Pandas & NumPy: Data manipulation and statistical analysis.
- Seaborn & Matplotlib: Advanced data visualization.
- Scikit-Learn: K-Means Clustering, StandardScaler, Linear Regression, Random Forest.
Before analysis, the raw data underwent a rigorous health check:
- Duplicate Removal: Identified and removed 184 duplicate rows to prevent model overfitting.
- Feature Pruning: Dropped constant columns (
Z_CostContact,Z_Revenue) that added zero variance. - Data Integrity: Recalculated the
MntTotalcolumn to ensure it mathematically equaled the sum of all individual product categories (Wines + Meat + Fruits + etc.), fixing discrepancies in 2,000+ rows. - Feature Engineering: Created new features such as
Children(Total Kids + Teens) andHas_Childfor family structure analysis.
We utilized NumPy and Seaborn to understand the shape of the business:
- The "Income Effect": Identified an exponential relationship between Income and Spending (Correlation: 0.82).
- Product Mix: Discovered that Wines (50%) and Meat (27%) account for nearly 80% of total revenue.
- The "Child Penalty": Spending drops by ~50% with one child and craters with 2+ children.
We used K-Means Clustering on standardized features (Income, Total Spending, Recency) to identify 4 distinct personas. The optimal
| Cluster Name | Profile | Strategy |
|---|---|---|
| Active Whales | High Income, High Spend, Active (<30 days). | Retain: Cross-sell premium items. |
| Churning VIPs | High Income, High Spend, Inactive (>70 days). | Win-Back: Target with "Campaign 5" (their favorite). |
| Promising Recent | Low Income, Low Spend, Active (<30 days). | Nurture: Offer lower-ticket deals to build habits. |
| At-Risk Budget | Low Income, Low Spend, Inactive (>70 days). | Automate: Move to low-cost drip campaigns. |
To find value beyond simple income brackets, we built a Linear Regression model to predict "Expected Spending" based on Income. We then analyzed the Residuals (Actual - Expected) to find over-performers.
Key Findings:
- Education: "Basic" education customers over-spend relative to their low income (+$$233), while PhDs actually under-spend relative to their high income (-$$22).
- Family: Having Teens causes a larger drop in discretionary food spending than having Toddlers.
- The "PhD Diet": Deep-dive category analysis revealed that PhDs significantly over-spend on Wine (+$61) but under-spend on Meat, Fish, and Fruits.
The notebook includes several intricate visualizations:
- Cluster Scatter Plot: Visualizing the income gap between VIPs and Mass Market.
- Campaign Acceptance Heatmap: Showing the drastic difference in conversion rates between "Active Whales" (27%) and "At-Risk Budget" (4%).
- Residual Heatmap: A color-coded matrix showing which Education levels over/under-spend on specific food categories.
- Normalized Profile Bar Charts: Comparing the relative strengths of Income vs. Recency across clusters.
Based on the data, the following actions are recommended:
- The "Win-Back" Campaign: Currently, your highest value segment ("Churning VIPs") is drifting away. They historically loved Campaign 5. Re-launch a lookalike of Campaign 5 targeted specifically at this cluster.
- Stop "Grocery" Ads for Families: Families with kids are not buying Meat/Fish from you (likely due to price). Pivot their marketing to "Treats" (Sweets/Gold) or Bulk Deals.
- Target PhDs with Wine: PhDs are "Liquid Dieters." Stop sending them fruit baskets. Market exclusive vintage wines to unlock their wallet share.
- Catalog is King: Analysis showed that
NumCatalogPurchasesis the strongest predictor of a customer spending more than their income suggests. Invest in the print catalog for high-income prospects.
- Install Requirements
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
- Launch the Notebook
jupyter notebook Marketing_Analytics_Project.ipynb
- Download the CSV Used in This Project
This can be found here (Kaggle account may be required)