Skip to content

Latest commit

Β 

History

History
88 lines (53 loc) Β· 2.17 KB

File metadata and controls

88 lines (53 loc) Β· 2.17 KB

πŸ›οΈ Mall Customer Segmentation and Predictive Modeling πŸ“Š Overview This project explores customer segmentation using the Mall Customer dataset. It involves data exploration, clustering, dimensionality reduction, and predictive modeling of customer behavior.

πŸ“ Dataset 200 rows Γ— 5 columns

Features:

CustomerID (if used)

Gender

Age

Annual Income (k$)

Spending Score (1–100)

🧹 Preprocessing Encoded Gender: Male β†’ 1, Female β†’ 0

Scaled numerical features for models when appropriate

Created labels for age_group and spending_score (low, medium, high)

πŸ“ˆ Exploratory Data Analysis Visualized distributions using histograms, boxplots, and heatmaps

Key insights:

Females have a higher average spending score

Age and income are not strongly correlated with spending

Spending patterns vary significantly across age groups

πŸ“‰ PCA (Dimensionality Reduction) Applied Principal Component Analysis to visualize high-dimensional clusters

First 2–3 components explained majority of the variance

PCA plots helped show cluster separation visually

πŸ“¦ Clustering βœ… KMeans Clustering Evaluated using Elbow Method and Silhouette Score

Best silhouette score at k=10, but k=5 chosen for interpretability

Visualized clusters in 2D and 3D with centroids

βœ… Agglomerative Clustering Dendrogram used to determine cut-off (y=80)

Cluster centers manually computed

Silhouette scores compared across linkage strategies (ward, complete, average)

🧠 Cluster Interpretation Cluster Profile Strategy 0 Practical Buyers (avg spenders) Loyalty offers, practical product focus 1 Young Big Spenders Premium campaigns, influencers 2 Young, Low-Income Spenders Discounts, trend-driven promotions 3 Rich but Frugal Quality/value-focused campaigns 4 Low Spend, Low Income Basic essentials, budget-focused ads

πŸ§ͺ Predictive Modeling Task: Predict spending_score_label (0 = low, 1 = mid, 2 = high)

Features used: age_label, gender, annual_income

Models trained:

Logistic Regression (with scaling)

Random Forest

XGBoost

Evaluation:

Accuracy, Precision, F1-score (weighted)

Used Stratified K-Fold Cross-Validation due to small dataset size

πŸ›  Tools Python, pandas, scikit-learn, matplotlib, seaborn