A multi-level demand prediction pipeline that generates three submission files:
- L1: predicts future demand at eclass level
- L2: predicts future demand at eclass + manufacturer level
- L3: predicts future demand at cluster level inside each eclass using product text and feature-based grouping
This pipeline is designed for large-scale transaction data and uses chunked loading, buyer-history signals, collaborative filtering, association rules, hierarchical cold-start logic, and intra-eclass clustering.
The script produces three levels of predictions:
Predicts which eclasses a buyer is likely to need.
Core logic:
- Warm buyers: direct history + light collaborative filtering + association rules
- Cold buyers: hierarchical NACE-based profiles
Predicts which eclass|manufacturer combinations a buyer is likely to need.
Core logic:
- Start from strong L1 eclass candidates
- Assign manufacturers only when confidence is high enough
- Supports top-2 manufacturer emission when buyer behavior is split between brands
Predicts which feature cluster inside an eclass a buyer is likely to need.
Core logic:
- Cluster products within each eclass using product text
- Use regex feature extraction + TF-IDF + adaptive clustering
- Warm buyers rely on direct cluster history
- Cold buyers rely on segment-level cluster popularity
- Predict eclass first, then refine to manufacturer or cluster
- Avoid using weather, seasonality, or SKU-level signals for L1/L2
- Use eclass as the main abstraction to reduce catalog noise
- Handle cold start with hierarchical industry signals:
(nace_2digits, emp_bucket)(nace_section, emp_bucket)nace_2digitsnace_sectionemp_bucketglobal
- Build L3 using within-eclass clustering instead of global clustering
- Keep L1 broad, L2 broader but filtered, and L3 structured and interpretable
The script expects the following files:
plis_training_cleaned.csvcustomer_test_cleaned.csvles_cs.csv
Optional:
nace_codes.csv
The script writes:
-
submission_L1.csv
Columns:buyer_id,predicted_id
predicted_id = eclass -
submission_L2.csv
Columns:buyer_id,predicted_id
predicted_id = eclass|manufacturer -
submission_L3.csv
Columns:buyer_id,predicted_id
predicted_id = cluster_id
Example:19020102_C04 -
cluster_manifest.csv
Human-readable description of L3 clusters
Columns:cluster_ideclassn_skusn_buyerstop_featuresexample_description
The pipeline assumes this mapping from les_cs.csv:
cs = 1→ warm-start buyer with usable historycs = 0→ cold-start buyer
Warm buyers are scored using behavioral history.
For L1, signals include:
- order frequency
- month coverage
- spend
- spend share
- order share
- recency
- price proxy
- light user-user collaborative filtering
- eclass association rules
For L2:
- strong eclass candidates are selected first
- manufacturer is assigned using buyer-level, segment-level, or global manufacturer preference
For L3:
- direct buyer-cluster history is used
- cross-sell signal is added using dominant NACE cluster for eclasses already showing real demand
Cold buyers do not rely on transaction history.
For L1/L2:
- predictions come from hierarchical NACE and company-size segment profiles
For L3:
- clusters are ranked by NACE segment popularity
For each buyer-eclass pair, the script computes a heuristic score using:
- spend
- frequency
- month coverage
- spend share within buyer
- order share within buyer
- price signal
- recency
Then it blends:
- direct buyer history
- user-user collaborative filtering
- association rules
This creates a broad but still behavior-driven candidate list.
If the buyer has no usable history, the pipeline builds a ranked candidate list from the most specific available profile:
(nace_2digits, emp_bucket)(nace_section, emp_bucket)nace_2digitsnace_sectionemp_bucketglobal
Each profile ranks eclasses by:
- prevalence in the segment
- average value
- average number of months
- average number of orders
L2 does not directly predict manufacturer from scratch.
Instead it follows:
- generate strong eclass candidates
- choose manufacturer with confidence gating
Manufacturer selection uses:
- buyer-level manufacturer concentration inside the eclass
- NACE 2-digit segment preference
- NACE section preference
- global preference
Manufacturer predictions are only emitted if:
- the eclass is globally brand-stable enough
- the top manufacturer has sufficient share or separation
- segment/global fallback scores are strong enough
The script can emit:
- 0 manufacturers
- 1 manufacturer
- 2 manufacturers
The top-2 case is used when the buyer clearly splits demand between two brands.
L3 predicts a more granular target: a cluster inside each eclass.
A second pass over the training file loads:
- SKU-like identifier
- product text fields if available
- transaction history needed for buyer-cluster scoring
If no SKU column exists, the script falls back to a pseudo-SKU:
eclass|manufacturer
Product text is normalized and structured features are extracted using regex patterns.
Examples of extracted feature families:
- size
- material
- standards / norms
- pack size
- color
- condition
- numeric bins
These are combined with raw text.
Products are clustered within each eclass, not globally.
Method:
- TF-IDF vectorization on feature-enhanced text
- adaptive
kselection using silhouette score - clustering with:
AgglomerativeClusteringfor smaller setsKMeansfor larger sets
If there is not enough text, clustering falls back to manufacturer grouping.
Cluster IDs are deterministic:
eclass_C00eclass_C01- etc.
Clusters are kept only if they are stable.
A stable cluster should typically have:
- enough buyers
- enough months of activity
- enough internal coherence
For warm buyers, cluster scores are built from:
- number of distinct months
- average price
- recency multiplier
This creates a buyer-cluster demand signal.
- Warm buyers: direct historical clusters + NACE cross-sell signal
- Cold buyers: top segment-ranked clusters for their NACE section
A readable summary is generated for all stable clusters.
This helps evaluate:
- what each cluster represents
- how many SKUs it contains
- which features define it
The script uses a large config dictionary, including settings for:
- warm thresholds
- cold top-N limits
- manufacturer confidence thresholds
- cold-start prevalence thresholds
warm_threshold_l3cold_top_n_l3min_cluster_buyersmin_cluster_monthsl3_cross_sell_min_eclass_orders
These parameters control how broad or selective each level is.
Install dependencies with:
pip install pandas numpy scipy scikit-learn