Skip to content

PolicyEngine has two different concepts of "input variable" that can diverge #417

@baogorek

Description

@baogorek

The Problem

The core issue: PolicyEngine has two different concepts of "input variable" that can diverge:

  1. Structural definition (Variable.is_input_variable()): A variable is a true input if it has no formulas AND no adds AND no subtracts. This is the
    "correct" definition.
  2. Runtime definition (sim.input_variables): Any variable that has values stored in the simulation (known periods). This is what gets used when saving
    H5 files.

The loophole: Variables with adds (like self_employed_pension_contribution_ald) are structurally NOT inputs - they're aggregations. But if the CPS
dataset contains pre-computed values for them, they appear in sim.input_variables at runtime.

Why it corrupts calculations:

self_employed_pension_contribution_ald
└── adds: [self_employed_pension_contribution_ald_person]
└── formula: min(self_employment_income, contributions)

The CPS has a stale pre-computed value (say $203K) that doesn't match the formula result (say $0 because SE income = 0). When you:

  1. Load the dataset and calculate fresh → formula runs → correct $0
  2. Save to H5 including the pseudo-input → stale $203K gets saved
  3. Reload and calculate → stored $203K overrides the formula → wrong AGI → wrong everything downstream

A Demo to reproduce:

Output:

(lt) baogorek@T14-Gen6:~/devl/long-term/policyengine-us-data/policyengine_us_data/datasets/cps/long_term$ python demo_pseudo_input_bug.py 
======================================================================
DEMONSTRATION: The Pseudo-Input Variable Bug
======================================================================

[STEP 1] Loading base dataset and finding a problematic case...
  Found 143 tax units with stale pseudo-input > $50k

  Example Tax Unit 535:
    Stored value:           $   58,172.36
    SE income:              $        0.00
    Contributions:          $        0.00
    Formula should give:    $        0.00
    --> Stale value doesn't match formula!

[STEP 2] Fresh calculation (correct behavior)...
  After clearing cache and recalculating:
    ALD (fresh):  $        0.00
    AGI (fresh):  $  342,647.16

[STEP 3] Saving H5 file WITH pseudo-input (simulating bug)...
  Saved to /tmp/tmp6i0o2ykx/buggy.h5
  Pseudo-input 'self_employed_pension_contribution_ald' was included: True

[STEP 4] Reloading H5 and demonstrating corruption...
  After reloading the buggy H5:
    ALD (reloaded): $   58,172.36
    AGI (reloaded): $  284,474.81

  COMPARISON:
                                             Fresh        Reloaded      Difference
    ---------------------------------------------------------------------------
    ALD                            $          0.00 $     58,172.36 $     58,172.36
    AGI                            $    342,647.16 $    284,474.81 $    -58,172.34

[STEP 5] Saving H5 WITHOUT pseudo-input (the fix)...
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
  Excluding 3 pseudo-input variables:
    - housing_assistance
    - self_employed_health_insurance_ald
    - self_employed_pension_contribution_ald

  Saved to /tmp/tmp6i0o2ykx/fixed.h5

[STEP 6] Verifying the fix...
  After reloading the FIXED H5:
    ALD (fixed):  $        0.00
    AGI (fixed):  $  342,647.16

  FINAL COMPARISON:
                                             Fresh        Buggy H5        Fixed H5
    ---------------------------------------------------------------------------
    ALD                            $          0.00 $     58,172.36 $          0.00
    AGI                            $    342,647.16 $    284,474.81 $    342,647.16

  Fix successful: True

======================================================================
END OF DEMONSTRATION
======================================================================

Code

"""
Demonstration: Why Pseudo-Input Variables Corrupt H5 Files

This script shows how variables that LOOK like inputs but AGGREGATE
calculated values can corrupt saved datasets.

The culprit: self_employed_pension_contribution_ald
- Has 'adds' attribute (sums person-level values)
- No explicit formula
- PolicyEngine treats it as an "input"
- BUT its component (self_employed_pension_contribution_ald_person) HAS a formula!

The formula: min(self_employment_income, self_employed_pension_contributions)

The bug manifests when:
1. Base data has stale pre-computed values
2. Fresh calculation gives correct values (formula runs)
3. We save to H5 including the pseudo-input (stale value gets stored)
4. We reload and calculate - the stored value overrides the formula
"""

import tempfile
import os
import h5py
import numpy as np
from policyengine_us import Microsimulation
from policyengine_core.enums import Enum

print("=" * 70)
print("DEMONSTRATION: The Pseudo-Input Variable Bug")
print("=" * 70)

YEAR = 2024
PSEUDO_INPUT = "self_employed_pension_contribution_ald"

# =============================================================================
# STEP 1: Load base dataset and find a problematic case
# =============================================================================
print(f"\n[STEP 1] Loading base dataset and finding a problematic case...")

sim = Microsimulation(
    dataset="hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"
)

# All variables at tax_unit level for consistent comparison
stale_values = sim.calculate(PSEUDO_INPUT, period=YEAR).values
se_income = sim.calculate(
    "self_employment_income_before_lsr", period=YEAR, map_to="tax_unit"
).values
contributions = sim.calculate(
    "self_employed_pension_contributions", period=YEAR, map_to="tax_unit"
).values

# Find tax units where stale value > 0 but formula would give 0
# Formula: min(SE_income, contributions) at person level, summed to tax unit
mask = (stale_values > 50000) & ((se_income == 0) | (contributions == 0))
problem_indices = np.where(mask)[0]

print(
    f"  Found {len(problem_indices)} tax units with stale pseudo-input > $50k"
)

tu_idx = problem_indices[0]
stale_value = stale_values[tu_idx]

print(f"\n  Example Tax Unit {tu_idx}:")
print(f"    Stored value:           ${stale_value:>12,.2f}")
print(f"    SE income:              ${se_income[tu_idx]:>12,.2f}")
print(f"    Contributions:          ${contributions[tu_idx]:>12,.2f}")
print(
    f"    Formula should give:    ${min(se_income[tu_idx], contributions[tu_idx]):>12,.2f}"
)
print(f"    --> Stale value doesn't match formula!")

# =============================================================================
# STEP 2: Calculate AGI with fresh formula (correct behavior)
# =============================================================================
print(f"\n[STEP 2] Fresh calculation (correct behavior)...")

# Delete the cached pseudo-input to force recalculation
sim.delete_arrays(PSEUDO_INPUT)

ald_fresh = sim.calculate(PSEUDO_INPUT, period=YEAR).values[tu_idx]
agi_fresh = sim.calculate("adjusted_gross_income", period=YEAR).values[tu_idx]

print(f"  After clearing cache and recalculating:")
print(f"    ALD (fresh):  ${ald_fresh:>12,.2f}")
print(f"    AGI (fresh):  ${agi_fresh:>12,.2f}")

# =============================================================================
# STEP 3: Save H5 WITH the pseudo-input (buggy behavior)
# =============================================================================
print(f"\n[STEP 3] Saving H5 file WITH pseudo-input (simulating bug)...")

# Reload to get stale values back
sim_buggy = Microsimulation(
    dataset="hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"
)

temp_dir = tempfile.mkdtemp()
buggy_h5_path = os.path.join(temp_dir, "buggy.h5")

# Save including the pseudo-input (this is the bug)
data = {}
input_vars = set(sim_buggy.input_variables)

for variable in sim_buggy.tax_benefit_system.variables:
    if variable not in input_vars:
        continue
    data[variable] = {}
    for period in sim_buggy.get_holder(variable).get_known_periods():
        values = sim_buggy.get_holder(variable).get_array(period)
        if sim_buggy.tax_benefit_system.variables.get(variable).value_type in (
            Enum,
            str,
        ):
            if hasattr(values, "decode_to_str"):
                values = values.decode_to_str().astype("S")
            else:
                values = values.astype("S")
        else:
            values = np.array(values)
        if values is not None:
            data[variable][period] = values
    if len(data[variable]) == 0:
        del data[variable]

with h5py.File(buggy_h5_path, "w") as f:
    for variable, periods in data.items():
        grp = f.create_group(variable)
        for period, values in periods.items():
            grp.create_dataset(str(period), data=values)

print(f"  Saved to {buggy_h5_path}")
print(f"  Pseudo-input '{PSEUDO_INPUT}' was included: {PSEUDO_INPUT in data}")

# =============================================================================
# STEP 4: Reload and show corruption
# =============================================================================
print(f"\n[STEP 4] Reloading H5 and demonstrating corruption...")

sim_reloaded = Microsimulation(dataset=buggy_h5_path)

# The stale value overrides the formula!
ald_reloaded = sim_reloaded.calculate(PSEUDO_INPUT, period=YEAR).values[tu_idx]
agi_reloaded = sim_reloaded.calculate(
    "adjusted_gross_income", period=YEAR
).values[tu_idx]

print(f"  After reloading the buggy H5:")
print(f"    ALD (reloaded): ${ald_reloaded:>12,.2f}")
print(f"    AGI (reloaded): ${agi_reloaded:>12,.2f}")

print(f"\n  COMPARISON:")
print(f"    {'':30} {'Fresh':>15} {'Reloaded':>15} {'Difference':>15}")
print(f"    {'-'*75}")
print(
    f"    {'ALD':30} ${ald_fresh:>14,.2f} ${ald_reloaded:>14,.2f} ${ald_reloaded - ald_fresh:>14,.2f}"
)
print(
    f"    {'AGI':30} ${agi_fresh:>14,.2f} ${agi_reloaded:>14,.2f} ${agi_reloaded - agi_fresh:>14,.2f}"
)

# =============================================================================
# STEP 5: Show the fix works
# =============================================================================
print(f"\n[STEP 5] Saving H5 WITHOUT pseudo-input (the fix)...")

sim_fixed = Microsimulation(
    dataset="hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"
)

fixed_h5_path = os.path.join(temp_dir, "fixed.h5")


def get_pseudo_input_variables(sim) -> set:
    """Identify pseudo-inputs that should NOT be saved."""
    tbs = sim.tax_benefit_system
    pseudo_inputs = set()
    for var_name in sim.input_variables:
        var = tbs.variables.get(var_name)
        if not var:
            continue
        adds = getattr(var, "adds", None)
        if adds and isinstance(adds, list):
            for component in adds:
                comp_var = tbs.variables.get(component)
                if comp_var and len(getattr(comp_var, "formulas", {})) > 0:
                    pseudo_inputs.add(var_name)
                    break
        subtracts = getattr(var, "subtracts", None)
        if subtracts and isinstance(subtracts, list):
            for component in subtracts:
                comp_var = tbs.variables.get(component)
                if comp_var and len(getattr(comp_var, "formulas", {})) > 0:
                    pseudo_inputs.add(var_name)
                    break
    return pseudo_inputs


# Save EXCLUDING pseudo-inputs
data_fixed = {}
input_vars_fixed = set(sim_fixed.input_variables)
pseudo_inputs = get_pseudo_input_variables(sim_fixed)
input_vars_fixed = input_vars_fixed - pseudo_inputs

print(f"  Excluding {len(pseudo_inputs)} pseudo-input variables:")
for var in sorted(pseudo_inputs):
    print(f"    - {var}")

for variable in sim_fixed.tax_benefit_system.variables:
    if variable not in input_vars_fixed:
        continue
    data_fixed[variable] = {}
    for period in sim_fixed.get_holder(variable).get_known_periods():
        values = sim_fixed.get_holder(variable).get_array(period)
        if sim_fixed.tax_benefit_system.variables.get(variable).value_type in (
            Enum,
            str,
        ):
            if hasattr(values, "decode_to_str"):
                values = values.decode_to_str().astype("S")
            else:
                values = values.astype("S")
        else:
            values = np.array(values)
        if values is not None:
            data_fixed[variable][period] = values
    if len(data_fixed[variable]) == 0:
        del data_fixed[variable]

with h5py.File(fixed_h5_path, "w") as f:
    for variable, periods in data_fixed.items():
        grp = f.create_group(variable)
        for period, values in periods.items():
            grp.create_dataset(str(period), data=values)

print(f"\n  Saved to {fixed_h5_path}")

# =============================================================================
# STEP 6: Verify the fix works
# =============================================================================
print(f"\n[STEP 6] Verifying the fix...")

sim_fixed_reloaded = Microsimulation(dataset=fixed_h5_path)

ald_fixed = sim_fixed_reloaded.calculate(PSEUDO_INPUT, period=YEAR).values[
    tu_idx
]
agi_fixed = sim_fixed_reloaded.calculate(
    "adjusted_gross_income", period=YEAR
).values[tu_idx]

print(f"  After reloading the FIXED H5:")
print(f"    ALD (fixed):  ${ald_fixed:>12,.2f}")
print(f"    AGI (fixed):  ${agi_fixed:>12,.2f}")

print(f"\n  FINAL COMPARISON:")
print(f"    {'':30} {'Fresh':>15} {'Buggy H5':>15} {'Fixed H5':>15}")
print(f"    {'-'*75}")
print(
    f"    {'ALD':30} ${ald_fresh:>14,.2f} ${ald_reloaded:>14,.2f} ${ald_fixed:>14,.2f}"
)
print(
    f"    {'AGI':30} ${agi_fresh:>14,.2f} ${agi_reloaded:>14,.2f} ${agi_fixed:>14,.2f}"
)

# Check if fix worked
ald_matches = abs(ald_fixed - ald_fresh) < 0.01
agi_matches = abs(agi_fixed - agi_fresh) < 0.01

print(f"\n  Fix successful: {ald_matches and agi_matches}")

# Cleanup
os.remove(buggy_h5_path)
os.remove(fixed_h5_path)
os.rmdir(temp_dir)

print("\n" + "=" * 70)
print("END OF DEMONSTRATION")
print("=" * 70)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions