Enterprise-grade data contract management and drift detection for ML/LLM pipelines to prevent silent schema changes and data quality regressions.
The Toolkit Data Contracts and Drift Detection tool provides a lightweight, dependency-free solution for maintaining data quality and consistency in machine learning pipelines. It automatically infers data contracts from samples, validates new data against established contracts, and detects drift before it impacts model performance.
- Automatic Contract Inference: Generate contracts from JSONL samples
- Schema Validation: Enforce data structure and type constraints
- Version Control: Track contract evolution over time
- Flexible Configuration: Allow extra fields, required fields, custom types
- Statistical Profiling: Build comprehensive baseline profiles
- Distribution Analysis: Track changes in data distributions
- Quality Gates: Automated validation with configurable thresholds
- CI/CD Integration: Exit codes for pipeline integration
- Zero Dependencies: Lightweight, easy to deploy
- CLI Interface: Simple command-line tools
- JSON Format: Human-readable contracts and profiles
- Batch Processing: Handle large datasets efficiently
# Install from source
git clone https://github.com/AKIVA-AI/toolkit-data-contracts.git
cd toolkit-data-contracts
pip install -e ".[dev]"
# Install in production
pip install toolkit-data-contracts-drift# 1. Infer contract from sample data
toolkit-contracts infer --input samples.jsonl --out contract.json
# 2. Create baseline profile
toolkit-contracts profile --input baseline.jsonl --contract contract.json --out baseline.profile.json
# 3. Validate new data and check for drift
toolkit-contracts check --input new_batch.jsonl --contract contract.json --baseline baseline.profile.jsoninfer- Generate contract from JSONL samplesprofile- Create baseline profile from datacheck- Validate data and detect drift
0- Validation passed4- Validation failed or drift detected
MIT License - see LICENSE file for details.