Batch-level constraint enforcement for tabular data
BatchGuard is a Python library that enforces global and group-level constraints on tabular datasets after they have been generated, sampled, or simulated. Rather than rejecting entire datasets or redesigning generators, BatchGuard minimally adjusts existing data so that high-level rules are satisfied while preserving local realism.
- Group-level constraints: Enforce min/max values, total ranges, and quotas per group
- Multiple enforcement strategies: Redistribution, scaling, and local rebalancing
- Transparent reporting: Track all changes with detailed reports
- Deterministic behavior: Same input always produces the same output
- Pandas integration: Works seamlessly with pandas DataFrames
Install from source:
git clone https://github.com/jaxver/batchguard.git
cd batchguard
pip install -e .import pandas as pd
from batchguard import BatchGuard, min_max, total_range
# Create sample data
df = pd.DataFrame({
'account': ['A', 'A', 'A', 'B', 'B', 'B'],
'amount': [100, 200, 300, 50, 60, 70],
})
# Define constraints
guard = BatchGuard(df)
guard.add_constraint(min_max('account', 'amount', min_val=100, max_val=250))
guard.add_constraint(total_range('account', 'amount', min_total=400))
# Enforce constraints
modified_df, report = guard.enforce()
print(report.to_summary())Fix datasets where some groups contain only extreme values:
from batchguard import BatchGuard, min_max
# Ensure each account has a realistic range of transaction amounts
guard = BatchGuard(transactions_df)
guard.add_constraint(
min_max('account_id', 'amount', min_val=10, max_val=1000)
)
fixed_df, report = guard.enforce()Correct totals that violate known limits:
from batchguard import BatchGuard, total_range
# Ensure scenario totals stay within bounds
guard = BatchGuard(scenarios_df)
guard.add_constraint(
total_range('scenario', 'value', min_total=0, max_total=1000000)
)
corrected_df, report = guard.enforce()Fix aggregate inconsistencies without manual editing:
from batchguard import BatchGuard, quota
# Ensure each category has a reasonable number of items
guard = BatchGuard(demo_df)
guard.add_constraint(quota('category', min_count=5, max_count=50))
balanced_df, report = guard.enforce()Enforce minimum and maximum values per group:
from batchguard import min_max
constraint = min_max(
group_col='account',
value_col='amount',
min_val=10,
max_val=1000
)Enforce total sum within a range per group:
from batchguard import total_range
constraint = total_range(
group_col='category',
value_col='sales',
min_total=1000,
max_total=10000
)Enforce row count per group:
from batchguard import quota
constraint = quota(
group_col='region',
min_count=10,
max_count=100
)BatchGuard supports multiple strategies for enforcing constraints:
- Redistribution: Redistributes values proportionally within groups
- Scaling: Scales values proportionally to fit constraints
- Local Rebalancing: Makes minimal targeted adjustments
Strategies are automatically selected based on constraint type, or you can specify them explicitly.
BatchGuard provides detailed reporting of all changes:
modified_df, report = guard.enforce()
# Get summary statistics
summary = report.get_summary()
print(f"Rows modified: {summary['total_rows_modified']}")
print(f"Values changed: {summary['total_values_changed']}")
# Get change log
changes = report.get_change_log()
print(changes)
# Export as JSON
json_report = report.to_json()
# Human-readable summary
print(report.to_summary())- Python 3.8+
- pandas >= 1.5.0
- numpy >= 1.20.0
MIT License - see LICENSE file for details.
Many datasets look valid when inspected row by row, but break obvious rules when viewed as a whole. BatchGuard addresses this gap by:
- Detecting aggregate violations that row-by-row validation misses
- Fixing inconsistencies without regenerating entire datasets
- Preserving local realism while ensuring global consistency
- Providing transparency about what changed and why
BatchGuard doesn't compete with data validators or generators; it complements them by handling the step where datasets are known to break, but no standard solution exists.