Author: Tanul Kumar Srivastava Medium Deep-Dive: Why Most Retail Customer Clusters Collapse in Production — and How I Fixed Mine
In real FMCG and retail systems, customer segmentation breaks down in production due to:
- Feature explosion over time
- Data drift across retraining cycles
- Unstable cluster assignments
- Extremely high noise ratios
- Poor reproducibility
Most clustering systems are built as one-off experiments, not long-running production systems.
This repository demonstrates a production-oriented alternative:
Learn stable behavioral representations first → then cluster
Instead of clustering on raw engineered features, the system:
- Learns compact latent embeddings via a deterministic Autoencoder
- Clusters in latent space using HDBSCAN
- Compares raw-feature vs. latent-space behavior across stability, noise, and cluster size distribution
This is not about inventing new algorithms — it is about building clustering that survives real-world retraining cycles.
Raw FMCG Transactions
│
▼
Eligibility Filtering
│
▼
Retailer Feature Engineering
│
├───────────────► Raw Feature Clustering (Baseline)
│
▼
Robust Scaling
│
▼
Autoencoder (Representation Learning)
│
▼
Latent Space Embeddings
│
▼
HDBSCAN Clustering
│
▼
Cluster Assignments + Membership Strengths
│
▼
Cluster Analytics & Stability Metrics
This project uses a synthetic-style, anonymized FMCG retail dataset published on Kaggle.
Dataset: tanulkumarsrivastava/sales-dataset
The dataset contains monthly aggregated sales with the schema (shop_id, product_id, year, month), with privacy-preserving transformations applied — no real customer or retailer data is included.
Intended use cases:
- Algorithm benchmarking
- Feature engineering experiments
- Clustering system design
stable-customer-segmentation/
├── core/
│ ├── __init__.py
│ ├── features.py # RetailClusteringFeatureBuilder
│ ├── logging.py # LoggerFactory
│ ├── model.py # AutoEncoderModelArchitecture
│ └── utils.py
├── autoencoder-training-pipeline.py
├── clustering-inference-pipeline.py
├── clustering-training-pipeline.py
├── feature-preparation-pipeline.py
├── .env.example
├── .gitignore
├── LICENSE
├── pyproject.toml
├── requirement.txt
├── run.sh
└── README.md
Note: The
/artifactsfolder (trained models, scalers, outputs) is not included in this repository due to file size. Running the full pipeline viarun.shregenerates all artifacts locally.
This is structured like a real ML system, not a notebook dump — reusable modules, explicit pipelines, and dedicated inference scripts.
- Python
>=3.11, <3.12 - A Kaggle API key for dataset download
git clone https://github.com/Tksrivastava/stable-customer-segmentation.git
cd stable-customer-segmentationcp .env.example .envEdit .env and fill in your Kaggle credentials:
KAGGLE_USERNAME="your_username"
KAGGLE_KEY="your_api_key"
KAGGLE_DATASET="tanulkumarsrivastava/sales-dataset/"python -m venv .venvLinux / macOS
source .venv/bin/activateWindows
.venv\Scripts\activatepip install -r requirement.txt
pip install -e .bash run.shThis executes all four stages in order and saves models and outputs into /artifacts.
python feature-preparation-pipeline.py- Downloads the dataset from Kaggle
- Filters retailers by minimum activity and tenure thresholds
- Builds behavioral features via
RetailClusteringFeatureBuilder:
| Feature | Description |
|---|---|
| Stability | Consistency of sales volume over time |
| Entropy | Diversity/randomness in purchasing patterns |
| Growth | Trend direction and rate |
| Seasonality | Periodic demand patterns |
| Volume Consistency | Month-over-month variance in quantities |
These are representation-learning-ready signals, not naive aggregates.
python autoencoder-training-pipeline.pyTrains a deterministic AutoEncoderModelArchitecture:
- Symmetric feed-forward architecture (encoder → latent → decoder)
- Linear latent space to preserve geometric structure for distance-based methods
- Encoder-side L2 regularization to prevent identity mapping
- Batch normalization in the encoder for training stability
- Layer normalization on the latent space to control scale drift
- MSE reconstruction objective with early stopping
Outputs saved to /artifacts:
autoencoder-model.keras— full trained autoencoderfeature-scaler.pkl— robust scaler fitted on training dataloss-plot.png— training/validation loss curve
GPU is disabled by design for deterministic, reproducible CPU execution.
python clustering-training-pipeline.pyTrains two HDBSCAN models for direct comparison:
| Strategy | Input | Purpose |
|---|---|---|
| Raw Feature HDBSCAN | Engineered features | Baseline |
| Latent Space HDBSCAN | Autoencoder embeddings | Production approach |
Both models are saved to /artifacts for use in inference.
python clustering-inference-pipeline.pyGenerates full cluster analytics:
- Cluster labels and membership strengths
- Cluster size distributions
- Noise ratios per strategy
- Comparative distribution reports
Outputs saved to /artifacts:
cluster_predictions.csvcluster_insights.csv
| Method | Clusters | Noise Ratio |
|---|---|---|
| Raw Features | 2 | ~99.6% |
| Latent Space | 4 | ~21.5% |
This is not a hyperparameter tuning trick — it is a systemic effect of decoupling representation learning from clustering. The latent space forces the model to learn smooth, structured behavioral manifolds that HDBSCAN can meaningfully partition.
FMCG retail data is high-dimensional, sparse, noisy, and non-linear. This combination addresses that directly:
Autoencoder — compresses signal, removes noise, and learns invariant behavioral structure that survives retraining cycles.
HDBSCAN — handles variable cluster density, models noise as a first-class concept, and works well in learned latent manifolds where distance is more meaningful than in raw feature space.
The architecture is already designed with production evolution in mind.
Experiment Tracking — MLflow
mlflow.log_model(autoencoder, "encoder")
mlflow.log_metric("noise_ratio", noise_ratio)
mlflow.log_metric("cluster_count", n_clusters)Feature Store / Database
Replace CSV loading with a proper data layer:
- Snowflake / BigQuery / Postgres for warehouse-backed features
- Feast for time-travel and backfill-safe feature retrieval
Scheduled Batch Inference
The inference pipeline already supports new retailers and new months. Add:
- A scheduler (Airflow / Dagster / Prefect)
- A data sink (warehouse table / REST API)
| Package | Version | Role |
|---|---|---|
| TensorFlow | 2.15.1 | Autoencoder training |
| hdbscan | 0.8.38.post2 | Density-based clustering |
| scikit-learn | 1.4.2 | Preprocessing, metrics |
| pandas | 2.2.2 | Data manipulation |
| numpy | 1.26.4 | Numerical operations |
| plotly + kaleido | 5.24.1 / 0.2.1 | Visualizations |
| kaggle | 1.5.16 | Dataset download |
| python-dotenv | 1.0.1 | Environment config |
Full dependency list in requirement.txt.
MIT License — see LICENSE for details.
For a full breakdown of the system design, failure modes of naive clustering, and the production reasoning behind this architecture:
👉 Why Most Retail Customer Clusters Collapse in Production — and How I Fixed Mine