Stable Hash Splitter

A scikit-learn compatible splitter for deterministic, ID-based train/test splits that prevent data leakage in machine learning workflows.

🔧 The Problem

When datasets grow or get updated, traditional random splits can cause data leakage: samples that were previously in your test set might end up in training during retraining, leading to overly optimistic and invalid model evaluations.

StableHashSplit solves this by assigning samples to train/test sets deterministically based on a hash of a stable identifier (e.g., user ID, transaction ID). Once assigned, a sample stays in the same set forever, ensuring reproducible and reliable evaluations across dataset versions.

✨ Key Features

🔒 Deterministic & Stable: Same ID always maps to the same split
🤖 Scikit-Learn Compatible: Works seamlessly with GridSearchCV, cross_val_score, and ML pipelines
📊 Flexible Inputs: Supports pandas DataFrames, NumPy arrays, and array-like structures
⚙️ Customizable: Choose your hash function and ID column
🚀 Simple API: Minimal code changes needed

📦 Installation

pip install stable-hash-splitter

🚀 Quick Start

import pandas as pd
from stable_hash_splitter import StableHashSplit

# Sample data with user IDs
data = pd.DataFrame({
    'user_id': [1001, 1002, 1003, 1004, 1005],
    'feature_1': [0.5, 0.3, 0.8, 0.1, 0.9],
    'feature_2': [10, 20, 30, 40, 50],
    'target': [1, 0, 1, 0, 1]
})

# Create stable splitter
splitter = StableHashSplit(test_size=0.2, id_column='user_id')

# Split your data
X_train, X_test, y_train, y_test = splitter.train_test_split(
    data[['user_id', 'feature_1', 'feature_2']],
    data['target']
)

print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")
# Output: Train size: 4, Test size: 1

📚 Advanced Usage

Using with GridSearchCV

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

splitter = StableHashSplit(test_size=0.2, id_column='user_id')
model = RandomForestClassifier()

param_grid = {'n_estimators': [50, 100], 'max_depth': [5, 10]}
grid_search = GridSearchCV(model, param_grid, cv=splitter)
grid_search.fit(X, y)  # X must contain the 'user_id' column

print(f"Best params: {grid_search.best_params_}")

Custom Hash Function

import hashlib

def custom_hash(id_value):
    return int(hashlib.md5(str(id_value).encode()).hexdigest(), 16)

splitter = StableHashSplit(
    test_size=0.3,
    id_column='user_id',
    hash_func=custom_hash
)

📖 API Reference