Skip to content

BatchGuard enforces global and group-level constraints on tabular datasets after they have been generated, sampled, or simulated.

License

Notifications You must be signed in to change notification settings

jaxver/batchguard

Repository files navigation

BatchGuard

Batch-level constraint enforcement for tabular data

BatchGuard is a Python library that enforces global and group-level constraints on tabular datasets after they have been generated, sampled, or simulated. Rather than rejecting entire datasets or redesigning generators, BatchGuard minimally adjusts existing data so that high-level rules are satisfied while preserving local realism.

Features

  • Group-level constraints: Enforce min/max values, total ranges, and quotas per group
  • Multiple enforcement strategies: Redistribution, scaling, and local rebalancing
  • Transparent reporting: Track all changes with detailed reports
  • Deterministic behavior: Same input always produces the same output
  • Pandas integration: Works seamlessly with pandas DataFrames

Installation

Install from source:

git clone https://github.com/jaxver/batchguard.git
cd batchguard
pip install -e .

Quick Start

import pandas as pd
from batchguard import BatchGuard, min_max, total_range

# Create sample data
df = pd.DataFrame({
    'account': ['A', 'A', 'A', 'B', 'B', 'B'],
    'amount': [100, 200, 300, 50, 60, 70],
})

# Define constraints
guard = BatchGuard(df)
guard.add_constraint(min_max('account', 'amount', min_val=100, max_val=250))
guard.add_constraint(total_range('account', 'amount', min_total=400))

# Enforce constraints
modified_df, report = guard.enforce()

print(report.to_summary())

Use Cases

1. Synthetic Test Data

Fix datasets where some groups contain only extreme values:

from batchguard import BatchGuard, min_max

# Ensure each account has a realistic range of transaction amounts
guard = BatchGuard(transactions_df)
guard.add_constraint(
    min_max('account_id', 'amount', min_val=10, max_val=1000)
)
fixed_df, report = guard.enforce()

2. Scenario Simulation

Correct totals that violate known limits:

from batchguard import BatchGuard, total_range

# Ensure scenario totals stay within bounds
guard = BatchGuard(scenarios_df)
guard.add_constraint(
    total_range('scenario', 'value', min_total=0, max_total=1000000)
)
corrected_df, report = guard.enforce()

3. Demo Datasets

Fix aggregate inconsistencies without manual editing:

from batchguard import BatchGuard, quota

# Ensure each category has a reasonable number of items
guard = BatchGuard(demo_df)
guard.add_constraint(quota('category', min_count=5, max_count=50))
balanced_df, report = guard.enforce()

Constraint Types

Group Min/Max

Enforce minimum and maximum values per group:

from batchguard import min_max

constraint = min_max(
    group_col='account',
    value_col='amount',
    min_val=10,
    max_val=1000
)

Group Total Range

Enforce total sum within a range per group:

from batchguard import total_range

constraint = total_range(
    group_col='category',
    value_col='sales',
    min_total=1000,
    max_total=10000
)

Group Quota

Enforce row count per group:

from batchguard import quota

constraint = quota(
    group_col='region',
    min_count=10,
    max_count=100
)

Enforcement Strategies

BatchGuard supports multiple strategies for enforcing constraints:

  • Redistribution: Redistributes values proportionally within groups
  • Scaling: Scales values proportionally to fit constraints
  • Local Rebalancing: Makes minimal targeted adjustments

Strategies are automatically selected based on constraint type, or you can specify them explicitly.

Reporting

BatchGuard provides detailed reporting of all changes:

modified_df, report = guard.enforce()

# Get summary statistics
summary = report.get_summary()
print(f"Rows modified: {summary['total_rows_modified']}")
print(f"Values changed: {summary['total_values_changed']}")

# Get change log
changes = report.get_change_log()
print(changes)

# Export as JSON
json_report = report.to_json()

# Human-readable summary
print(report.to_summary())

Requirements

  • Python 3.8+
  • pandas >= 1.5.0
  • numpy >= 1.20.0

License

MIT License - see LICENSE file for details.

Why BatchGuard?

Many datasets look valid when inspected row by row, but break obvious rules when viewed as a whole. BatchGuard addresses this gap by:

  • Detecting aggregate violations that row-by-row validation misses
  • Fixing inconsistencies without regenerating entire datasets
  • Preserving local realism while ensuring global consistency
  • Providing transparency about what changed and why

BatchGuard doesn't compete with data validators or generators; it complements them by handling the step where datasets are known to break, but no standard solution exists.

About

BatchGuard enforces global and group-level constraints on tabular datasets after they have been generated, sampled, or simulated.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages