A scikit-learn compatible splitter for deterministic, ID-based train/test splits that prevent data leakage in machine learning workflows.
When datasets grow or get updated, traditional random splits can cause data leakage: samples that were previously in your test set might end up in training during retraining, leading to overly optimistic and invalid model evaluations.
StableHashSplit solves this by assigning samples to train/test sets deterministically based on a hash of a stable identifier (e.g., user ID, transaction ID). Once assigned, a sample stays in the same set forever, ensuring reproducible and reliable evaluations across dataset versions.
- 🔒 Deterministic & Stable: Same ID always maps to the same split
- 🤖 Scikit-Learn Compatible: Works seamlessly with
GridSearchCV,cross_val_score, and ML pipelines - 📊 Flexible Inputs: Supports pandas DataFrames, NumPy arrays, and array-like structures
- ⚙️ Customizable: Choose your hash function and ID column
- 🚀 Simple API: Minimal code changes needed
pip install stable-hash-splitterimport pandas as pd
from stable_hash_splitter import StableHashSplit
# Sample data with user IDs
data = pd.DataFrame({
'user_id': [1001, 1002, 1003, 1004, 1005],
'feature_1': [0.5, 0.3, 0.8, 0.1, 0.9],
'feature_2': [10, 20, 30, 40, 50],
'target': [1, 0, 1, 0, 1]
})
# Create stable splitter
splitter = StableHashSplit(test_size=0.2, id_column='user_id')
# Split your data
X_train, X_test, y_train, y_test = splitter.train_test_split(
data[['user_id', 'feature_1', 'feature_2']],
data['target']
)
print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")
# Output: Train size: 4, Test size: 1from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
splitter = StableHashSplit(test_size=0.2, id_column='user_id')
model = RandomForestClassifier()
param_grid = {'n_estimators': [50, 100], 'max_depth': [5, 10]}
grid_search = GridSearchCV(model, param_grid, cv=splitter)
grid_search.fit(X, y) # X must contain the 'user_id' column
print(f"Best params: {grid_search.best_params_}")import hashlib
def custom_hash(id_value):
return int(hashlib.md5(str(id_value).encode()).hexdigest(), 16)
splitter = StableHashSplit(
test_size=0.3,
id_column='user_id',
hash_func=custom_hash
)StableHashSplit(test_size=0.2, id_column='id', hash_func=None, random_state=None)Parameters:
test_size(float): Fraction of samples for test set (0 < test_size < 1)id_column(str | int | None): Column name/index with stable IDs. Uses DataFrame index if Nonehash_func(callable): Function mapping ID to non-negative integer. Defaults to CRC32random_state: Ignored (for scikit-learn compatibility)
Methods:
split(X, y=None): Returns train/test indicesget_n_splits(): Returns 1 (single split)train_test_split(X, y): Convenience method for direct splitting
We welcome contributions! Please:
- Open an issue to discuss your idea
- Fork the repository
- Create a feature branch
- Submit a pull request
For development setup, see PUBLISH.md.
This project is licensed under the MIT License - see the LICENSE file for details.
Inspired by ID-based splitting concepts from Aurélien Géron's "Hands-On Machine Learning with Scikit-Learn and PyTorch". This is an independent implementation.
- 📧 Email: baraa-hazaa00@hotmail.com
- 🐛 Issues: GitHub Issues
- 📚 Documentation: This README and docstrings