Stable Customer Segmentation in FMCG Retail

A Representation-Learning-First Clustering System

Author: Tanul Kumar Srivastava Medium Deep-Dive: Why Most Retail Customer Clusters Collapse in Production — and How I Fixed Mine

What This Project Solves

In real FMCG and retail systems, customer segmentation breaks down in production due to:

Feature explosion over time
Data drift across retraining cycles
Unstable cluster assignments
Extremely high noise ratios
Poor reproducibility

Most clustering systems are built as one-off experiments, not long-running production systems.

This repository demonstrates a production-oriented alternative:

Learn stable behavioral representations first → then cluster

Instead of clustering on raw engineered features, the system:

Learns compact latent embeddings via a deterministic Autoencoder
Clusters in latent space using HDBSCAN
Compares raw-feature vs. latent-space behavior across stability, noise, and cluster size distribution

This is not about inventing new algorithms — it is about building clustering that survives real-world retraining cycles.

System Architecture

Raw FMCG Transactions
        │
        ▼
Eligibility Filtering
        │
        ▼
Retailer Feature Engineering
        │
        ├───────────────► Raw Feature Clustering (Baseline)
        │
        ▼
Robust Scaling
        │
        ▼
Autoencoder (Representation Learning)
        │
        ▼
Latent Space Embeddings
        │
        ▼
HDBSCAN Clustering
        │
        ▼
Cluster Assignments + Membership Strengths
        │
        ▼
Cluster Analytics & Stability Metrics

Dataset

This project uses a synthetic-style, anonymized FMCG retail dataset published on Kaggle.

Dataset: tanulkumarsrivastava/sales-dataset

The dataset contains monthly aggregated sales with the schema (shop_id, product_id, year, month), with privacy-preserving transformations applied — no real customer or retailer data is included.

Intended use cases:

Algorithm benchmarking
Feature engineering experiments
Clustering system design

Repository Structure

stable-customer-segmentation/
├── core/
│   ├── __init__.py
│   ├── features.py               # RetailClusteringFeatureBuilder
│   ├── logging.py                # LoggerFactory
│   ├── model.py                  # AutoEncoderModelArchitecture
│   └── utils.py
├── autoencoder-training-pipeline.py
├── clustering-inference-pipeline.py
├── clustering-training-pipeline.py
├── feature-preparation-pipeline.py
├── .env.example
├── .gitignore
├── LICENSE
├── pyproject.toml
├── requirement.txt
├── run.sh
└── README.md

Note: The /artifacts folder (trained models, scalers, outputs) is not included in this repository due to file size. Running the full pipeline via run.sh regenerates all artifacts locally.

This is structured like a real ML system, not a notebook dump — reusable modules, explicit pipelines, and dedicated inference scripts.

Getting Started

Prerequisites

Python >=3.11, <3.12
A Kaggle API key for dataset download

1. Clone the repository

git clone https://github.com/Tksrivastava/stable-customer-segmentation.git
cd stable-customer-segmentation

2. Set up environment variables

cp .env.example .env

Edit .env and fill in your Kaggle credentials:

KAGGLE_USERNAME="your_username"
KAGGLE_KEY="your_api_key"
KAGGLE_DATASET="tanulkumarsrivastava/sales-dataset/"

3. Create and activate a virtual environment

python -m venv .venv

Linux / macOS

source .venv/bin/activate

Windows

.venv\Scripts\activate

4. Install dependencies

pip install -r requirement.txt
pip install -e .

5. Run the full pipeline

bash run.sh

This executes all four stages in order and saves models and outputs into /artifacts.

Pipeline Stages

Stage 1 — Feature Preparation

python feature-preparation-pipeline.py

Downloads the dataset from Kaggle
Filters retailers by minimum activity and tenure thresholds
Builds behavioral features via RetailClusteringFeatureBuilder:

Feature	Description
Stability	Consistency of sales volume over time
Entropy	Diversity/randomness in purchasing patterns
Growth	Trend direction and rate
Seasonality	Periodic demand patterns
Volume Consistency	Month-over-month variance in quantities

These are representation-learning-ready signals, not naive aggregates.

Stage 2 — Autoencoder Training

python autoencoder-training-pipeline.py

Trains a deterministic AutoEncoderModelArchitecture:

Symmetric feed-forward architecture (encoder → latent → decoder)
Linear latent space to preserve geometric structure for distance-based methods
Encoder-side L2 regularization to prevent identity mapping
Batch normalization in the encoder for training stability
Layer normalization on the latent space to control scale drift
MSE reconstruction objective with early stopping

Outputs saved to /artifacts:

autoencoder-model.keras — full trained autoencoder
feature-scaler.pkl — robust scaler fitted on training data
loss-plot.png — training/validation loss curve

GPU is disabled by design for deterministic, reproducible CPU execution.

Stage 3 — Clustering Training

python clustering-training-pipeline.py

Trains two HDBSCAN models for direct comparison:

Strategy	Input	Purpose
Raw Feature HDBSCAN	Engineered features	Baseline
Latent Space HDBSCAN	Autoencoder embeddings	Production approach

Both models are saved to /artifacts for use in inference.

Stage 4 — Inference & Analytics

python clustering-inference-pipeline.py

Generates full cluster analytics:

Cluster labels and membership strengths
Cluster size distributions
Noise ratios per strategy
Comparative distribution reports

Outputs saved to /artifacts:

cluster_predictions.csv
cluster_insights.csv

Key Results

Method	Clusters	Noise Ratio
Raw Features	2	~99.6%
Latent Space	4	~21.5%

This is not a hyperparameter tuning trick — it is a systemic effect of decoupling representation learning from clustering. The latent space forces the model to learn smooth, structured behavioral manifolds that HDBSCAN can meaningfully partition.

Why Autoencoder + HDBSCAN?

FMCG retail data is high-dimensional, sparse, noisy, and non-linear. This combination addresses that directly:

Autoencoder — compresses signal, removes noise, and learns invariant behavioral structure that survives retraining cycles.

HDBSCAN — handles variable cluster density, models noise as a first-class concept, and works well in learned latent manifolds where distance is more meaningful than in raw feature space.

Extending to Production

The architecture is already designed with production evolution in mind.

Experiment Tracking — MLflow

mlflow.log_model(autoencoder, "encoder")
mlflow.log_metric("noise_ratio", noise_ratio)
mlflow.log_metric("cluster_count", n_clusters)

Feature Store / Database

Replace CSV loading with a proper data layer:

Snowflake / BigQuery / Postgres for warehouse-backed features
Feast for time-travel and backfill-safe feature retrieval

Scheduled Batch Inference

The inference pipeline already supports new retailers and new months. Add:

A scheduler (Airflow / Dagster / Prefect)
A data sink (warehouse table / REST API)

Core Dependencies

Package	Version	Role
TensorFlow	2.15.1	Autoencoder training
hdbscan	0.8.38.post2	Density-based clustering
scikit-learn	1.4.2	Preprocessing, metrics
pandas	2.2.2	Data manipulation
numpy	1.26.4	Numerical operations
plotly + kaleido	5.24.1 / 0.2.1	Visualizations
kaggle	1.5.16	Dataset download
python-dotenv	1.0.1	Environment config

Full dependency list in requirement.txt.

License

MIT License — see LICENSE for details.

Deep Technical Write-Up

For a full breakdown of the system design, failure modes of naive clustering, and the production reasoning behind this architecture:

👉 Why Most Retail Customer Clusters Collapse in Production — and How I Fixed Mine

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stable Customer Segmentation in FMCG Retail

A Representation-Learning-First Clustering System

What This Project Solves

System Architecture

Dataset

Repository Structure

Getting Started

Prerequisites

1. Clone the repository

2. Set up environment variables

3. Create and activate a virtual environment

4. Install dependencies

5. Run the full pipeline

Pipeline Stages

Stage 1 — Feature Preparation

Stage 2 — Autoencoder Training

Stage 3 — Clustering Training

Stage 4 — Inference & Analytics

Key Results

Why Autoencoder + HDBSCAN?

Extending to Production

Core Dependencies

License

Deep Technical Write-Up

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
core		core
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
autoencoder-training-pipeline.py		autoencoder-training-pipeline.py
clustering-inference-pipeline.py		clustering-inference-pipeline.py
clustering-training-pipeline.py		clustering-training-pipeline.py
feature-preparation-pipeline.py		feature-preparation-pipeline.py
pyproject.toml		pyproject.toml
requirement.txt		requirement.txt
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

Stable Customer Segmentation in FMCG Retail

A Representation-Learning-First Clustering System

What This Project Solves

System Architecture

Dataset

Repository Structure

Getting Started

Prerequisites

1. Clone the repository

2. Set up environment variables

3. Create and activate a virtual environment

4. Install dependencies

5. Run the full pipeline

Pipeline Stages

Stage 1 — Feature Preparation

Stage 2 — Autoencoder Training

Stage 3 — Clustering Training

Stage 4 — Inference & Analytics

Key Results

Why Autoencoder + HDBSCAN?

Extending to Production

Core Dependencies

License

Deep Technical Write-Up

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages