-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
The Problem
The core issue: PolicyEngine has two different concepts of "input variable" that can diverge:
- Structural definition (Variable.is_input_variable()): A variable is a true input if it has no formulas AND no adds AND no subtracts. This is the
"correct" definition. - Runtime definition (sim.input_variables): Any variable that has values stored in the simulation (known periods). This is what gets used when saving
H5 files.
The loophole: Variables with adds (like self_employed_pension_contribution_ald) are structurally NOT inputs - they're aggregations. But if the CPS
dataset contains pre-computed values for them, they appear in sim.input_variables at runtime.
Why it corrupts calculations:
self_employed_pension_contribution_ald
└── adds: [self_employed_pension_contribution_ald_person]
└── formula: min(self_employment_income, contributions)
The CPS has a stale pre-computed value (say $203K) that doesn't match the formula result (say $0 because SE income = 0). When you:
- Load the dataset and calculate fresh → formula runs → correct $0
- Save to H5 including the pseudo-input → stale $203K gets saved
- Reload and calculate → stored $203K overrides the formula → wrong AGI → wrong everything downstream
A Demo to reproduce:
Output:
(lt) baogorek@T14-Gen6:~/devl/long-term/policyengine-us-data/policyengine_us_data/datasets/cps/long_term$ python demo_pseudo_input_bug.py
======================================================================
DEMONSTRATION: The Pseudo-Input Variable Bug
======================================================================
[STEP 1] Loading base dataset and finding a problematic case...
Found 143 tax units with stale pseudo-input > $50k
Example Tax Unit 535:
Stored value: $ 58,172.36
SE income: $ 0.00
Contributions: $ 0.00
Formula should give: $ 0.00
--> Stale value doesn't match formula!
[STEP 2] Fresh calculation (correct behavior)...
After clearing cache and recalculating:
ALD (fresh): $ 0.00
AGI (fresh): $ 342,647.16
[STEP 3] Saving H5 file WITH pseudo-input (simulating bug)...
Saved to /tmp/tmp6i0o2ykx/buggy.h5
Pseudo-input 'self_employed_pension_contribution_ald' was included: True
[STEP 4] Reloading H5 and demonstrating corruption...
After reloading the buggy H5:
ALD (reloaded): $ 58,172.36
AGI (reloaded): $ 284,474.81
COMPARISON:
Fresh Reloaded Difference
---------------------------------------------------------------------------
ALD $ 0.00 $ 58,172.36 $ 58,172.36
AGI $ 342,647.16 $ 284,474.81 $ -58,172.34
[STEP 5] Saving H5 WITHOUT pseudo-input (the fix)...
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Excluding 3 pseudo-input variables:
- housing_assistance
- self_employed_health_insurance_ald
- self_employed_pension_contribution_ald
Saved to /tmp/tmp6i0o2ykx/fixed.h5
[STEP 6] Verifying the fix...
After reloading the FIXED H5:
ALD (fixed): $ 0.00
AGI (fixed): $ 342,647.16
FINAL COMPARISON:
Fresh Buggy H5 Fixed H5
---------------------------------------------------------------------------
ALD $ 0.00 $ 58,172.36 $ 0.00
AGI $ 342,647.16 $ 284,474.81 $ 342,647.16
Fix successful: True
======================================================================
END OF DEMONSTRATION
======================================================================
Code
"""
Demonstration: Why Pseudo-Input Variables Corrupt H5 Files
This script shows how variables that LOOK like inputs but AGGREGATE
calculated values can corrupt saved datasets.
The culprit: self_employed_pension_contribution_ald
- Has 'adds' attribute (sums person-level values)
- No explicit formula
- PolicyEngine treats it as an "input"
- BUT its component (self_employed_pension_contribution_ald_person) HAS a formula!
The formula: min(self_employment_income, self_employed_pension_contributions)
The bug manifests when:
1. Base data has stale pre-computed values
2. Fresh calculation gives correct values (formula runs)
3. We save to H5 including the pseudo-input (stale value gets stored)
4. We reload and calculate - the stored value overrides the formula
"""
import tempfile
import os
import h5py
import numpy as np
from policyengine_us import Microsimulation
from policyengine_core.enums import Enum
print("=" * 70)
print("DEMONSTRATION: The Pseudo-Input Variable Bug")
print("=" * 70)
YEAR = 2024
PSEUDO_INPUT = "self_employed_pension_contribution_ald"
# =============================================================================
# STEP 1: Load base dataset and find a problematic case
# =============================================================================
print(f"\n[STEP 1] Loading base dataset and finding a problematic case...")
sim = Microsimulation(
dataset="hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"
)
# All variables at tax_unit level for consistent comparison
stale_values = sim.calculate(PSEUDO_INPUT, period=YEAR).values
se_income = sim.calculate(
"self_employment_income_before_lsr", period=YEAR, map_to="tax_unit"
).values
contributions = sim.calculate(
"self_employed_pension_contributions", period=YEAR, map_to="tax_unit"
).values
# Find tax units where stale value > 0 but formula would give 0
# Formula: min(SE_income, contributions) at person level, summed to tax unit
mask = (stale_values > 50000) & ((se_income == 0) | (contributions == 0))
problem_indices = np.where(mask)[0]
print(
f" Found {len(problem_indices)} tax units with stale pseudo-input > $50k"
)
tu_idx = problem_indices[0]
stale_value = stale_values[tu_idx]
print(f"\n Example Tax Unit {tu_idx}:")
print(f" Stored value: ${stale_value:>12,.2f}")
print(f" SE income: ${se_income[tu_idx]:>12,.2f}")
print(f" Contributions: ${contributions[tu_idx]:>12,.2f}")
print(
f" Formula should give: ${min(se_income[tu_idx], contributions[tu_idx]):>12,.2f}"
)
print(f" --> Stale value doesn't match formula!")
# =============================================================================
# STEP 2: Calculate AGI with fresh formula (correct behavior)
# =============================================================================
print(f"\n[STEP 2] Fresh calculation (correct behavior)...")
# Delete the cached pseudo-input to force recalculation
sim.delete_arrays(PSEUDO_INPUT)
ald_fresh = sim.calculate(PSEUDO_INPUT, period=YEAR).values[tu_idx]
agi_fresh = sim.calculate("adjusted_gross_income", period=YEAR).values[tu_idx]
print(f" After clearing cache and recalculating:")
print(f" ALD (fresh): ${ald_fresh:>12,.2f}")
print(f" AGI (fresh): ${agi_fresh:>12,.2f}")
# =============================================================================
# STEP 3: Save H5 WITH the pseudo-input (buggy behavior)
# =============================================================================
print(f"\n[STEP 3] Saving H5 file WITH pseudo-input (simulating bug)...")
# Reload to get stale values back
sim_buggy = Microsimulation(
dataset="hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"
)
temp_dir = tempfile.mkdtemp()
buggy_h5_path = os.path.join(temp_dir, "buggy.h5")
# Save including the pseudo-input (this is the bug)
data = {}
input_vars = set(sim_buggy.input_variables)
for variable in sim_buggy.tax_benefit_system.variables:
if variable not in input_vars:
continue
data[variable] = {}
for period in sim_buggy.get_holder(variable).get_known_periods():
values = sim_buggy.get_holder(variable).get_array(period)
if sim_buggy.tax_benefit_system.variables.get(variable).value_type in (
Enum,
str,
):
if hasattr(values, "decode_to_str"):
values = values.decode_to_str().astype("S")
else:
values = values.astype("S")
else:
values = np.array(values)
if values is not None:
data[variable][period] = values
if len(data[variable]) == 0:
del data[variable]
with h5py.File(buggy_h5_path, "w") as f:
for variable, periods in data.items():
grp = f.create_group(variable)
for period, values in periods.items():
grp.create_dataset(str(period), data=values)
print(f" Saved to {buggy_h5_path}")
print(f" Pseudo-input '{PSEUDO_INPUT}' was included: {PSEUDO_INPUT in data}")
# =============================================================================
# STEP 4: Reload and show corruption
# =============================================================================
print(f"\n[STEP 4] Reloading H5 and demonstrating corruption...")
sim_reloaded = Microsimulation(dataset=buggy_h5_path)
# The stale value overrides the formula!
ald_reloaded = sim_reloaded.calculate(PSEUDO_INPUT, period=YEAR).values[tu_idx]
agi_reloaded = sim_reloaded.calculate(
"adjusted_gross_income", period=YEAR
).values[tu_idx]
print(f" After reloading the buggy H5:")
print(f" ALD (reloaded): ${ald_reloaded:>12,.2f}")
print(f" AGI (reloaded): ${agi_reloaded:>12,.2f}")
print(f"\n COMPARISON:")
print(f" {'':30} {'Fresh':>15} {'Reloaded':>15} {'Difference':>15}")
print(f" {'-'*75}")
print(
f" {'ALD':30} ${ald_fresh:>14,.2f} ${ald_reloaded:>14,.2f} ${ald_reloaded - ald_fresh:>14,.2f}"
)
print(
f" {'AGI':30} ${agi_fresh:>14,.2f} ${agi_reloaded:>14,.2f} ${agi_reloaded - agi_fresh:>14,.2f}"
)
# =============================================================================
# STEP 5: Show the fix works
# =============================================================================
print(f"\n[STEP 5] Saving H5 WITHOUT pseudo-input (the fix)...")
sim_fixed = Microsimulation(
dataset="hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"
)
fixed_h5_path = os.path.join(temp_dir, "fixed.h5")
def get_pseudo_input_variables(sim) -> set:
"""Identify pseudo-inputs that should NOT be saved."""
tbs = sim.tax_benefit_system
pseudo_inputs = set()
for var_name in sim.input_variables:
var = tbs.variables.get(var_name)
if not var:
continue
adds = getattr(var, "adds", None)
if adds and isinstance(adds, list):
for component in adds:
comp_var = tbs.variables.get(component)
if comp_var and len(getattr(comp_var, "formulas", {})) > 0:
pseudo_inputs.add(var_name)
break
subtracts = getattr(var, "subtracts", None)
if subtracts and isinstance(subtracts, list):
for component in subtracts:
comp_var = tbs.variables.get(component)
if comp_var and len(getattr(comp_var, "formulas", {})) > 0:
pseudo_inputs.add(var_name)
break
return pseudo_inputs
# Save EXCLUDING pseudo-inputs
data_fixed = {}
input_vars_fixed = set(sim_fixed.input_variables)
pseudo_inputs = get_pseudo_input_variables(sim_fixed)
input_vars_fixed = input_vars_fixed - pseudo_inputs
print(f" Excluding {len(pseudo_inputs)} pseudo-input variables:")
for var in sorted(pseudo_inputs):
print(f" - {var}")
for variable in sim_fixed.tax_benefit_system.variables:
if variable not in input_vars_fixed:
continue
data_fixed[variable] = {}
for period in sim_fixed.get_holder(variable).get_known_periods():
values = sim_fixed.get_holder(variable).get_array(period)
if sim_fixed.tax_benefit_system.variables.get(variable).value_type in (
Enum,
str,
):
if hasattr(values, "decode_to_str"):
values = values.decode_to_str().astype("S")
else:
values = values.astype("S")
else:
values = np.array(values)
if values is not None:
data_fixed[variable][period] = values
if len(data_fixed[variable]) == 0:
del data_fixed[variable]
with h5py.File(fixed_h5_path, "w") as f:
for variable, periods in data_fixed.items():
grp = f.create_group(variable)
for period, values in periods.items():
grp.create_dataset(str(period), data=values)
print(f"\n Saved to {fixed_h5_path}")
# =============================================================================
# STEP 6: Verify the fix works
# =============================================================================
print(f"\n[STEP 6] Verifying the fix...")
sim_fixed_reloaded = Microsimulation(dataset=fixed_h5_path)
ald_fixed = sim_fixed_reloaded.calculate(PSEUDO_INPUT, period=YEAR).values[
tu_idx
]
agi_fixed = sim_fixed_reloaded.calculate(
"adjusted_gross_income", period=YEAR
).values[tu_idx]
print(f" After reloading the FIXED H5:")
print(f" ALD (fixed): ${ald_fixed:>12,.2f}")
print(f" AGI (fixed): ${agi_fixed:>12,.2f}")
print(f"\n FINAL COMPARISON:")
print(f" {'':30} {'Fresh':>15} {'Buggy H5':>15} {'Fixed H5':>15}")
print(f" {'-'*75}")
print(
f" {'ALD':30} ${ald_fresh:>14,.2f} ${ald_reloaded:>14,.2f} ${ald_fixed:>14,.2f}"
)
print(
f" {'AGI':30} ${agi_fresh:>14,.2f} ${agi_reloaded:>14,.2f} ${agi_fixed:>14,.2f}"
)
# Check if fix worked
ald_matches = abs(ald_fixed - ald_fresh) < 0.01
agi_matches = abs(agi_fixed - agi_fresh) < 0.01
print(f"\n Fix successful: {ald_matches and agi_matches}")
# Cleanup
os.remove(buggy_h5_path)
os.remove(fixed_h5_path)
os.rmdir(temp_dir)
print("\n" + "=" * 70)
print("END OF DEMONSTRATION")
print("=" * 70)Metadata
Metadata
Assignees
Labels
No labels