Add spec-based target pipeline with SOI CD data ingestion by donboyd5 · Pull Request #471 · PSLmodels/tax-microdata-benchmarking

donboyd5 · 2026-03-25T17:54:45Z

PR: Add spec-based target pipeline with SOI CD data ingestion

This is PR 2 of 4 adding congressional district (CD) weighting to TMD.
Based on PR 1 (solver robustness). See PR 1 for the full roadmap.

Summary

Replace the hardcoded target generation logic with a clean three-artifact
architecture that separates concerns by change frequency:

Shares (stable) — SOI geographic distribution, pre-computed from SOI
data. Changes only with a new SOI vintage (~annually).
Spec (recipe) — flat CSV, one row per target. WYSIWYG. Changes during
recipe tuning.
Targets (volatile) — target = TMD_national_sum * share. Recomputed
whenever TMD data or the spec changes.

This PR also adds SOI congressional district data ingestion with the
117th-to-118th Congress crosswalk, and a workaround for a confirmed bug in the
source SOI data (column A59664 in dollars instead of $1,000s).

Architecture

SOI data + crosswalks -> shares (stable)        <- rarely changes
                              |
TMD data (cached_allvars) -> national sums       <- changes with TMD rebuilds
                              |
          shares * national sums = potential targets
                              |
          target spec -> select from potential    <- changes during recipe tuning
                              |
          per-area _targets.csv files

Key design decisions

Shares use CD file's own totals as denominators — internally consistent,
sums to 1.0 across all CDs for each variable/bin.
XTOT uses Census 2020 population as a fixed per-area target from the geocorr crosswalk, similar to the state pipeline approach.
117th-to-118th Congress crosswalk properly handles MT (1 to 2 districts),
at-large states (AK, DC, DE, ND, SD, VT, WY), and split districts.
SOI A59664 bug workaround — CD file column A59664 (EITC, 3+ children) is
in dollars instead of $1,000s. Divided by 1000 on ingestion. State file is not
affected. @donboyd5 will report the bug to the IRS SOI unit.
Variable name mapping — SOI raw names (A00100) map to TMD names (c00100)
via ALL_SHARING_MAPPINGS in constants.py. Multiple TMD variables can share
one SOI proxy (e.g., e01500 and e01700 both use SOI 01700).

New files

File	Purpose
`tmd/areas/prepare_shares.py`	Pre-compute SOI geographic shares for states and CDs
`tmd/areas/prepare/soi_cd_data.py`	CD SOI data reader, crosswalk, base target construction
`tmd/areas/prepare/recipes/cd_target_spec.csv`	CD recipe — 107 targets per CD
`tmd/areas/prepare/recipes/state_target_spec.csv`	State recipe — 169 targets per state
`tmd/areas/prepare/data/soi_cds/22incd.csv`	Raw 2022 SOI CD data

Modified files

File	Change
`tmd/areas/prepare/constants.py`	Add AreaType.CD, CD_AGI_CUTS, helper functions, extended mappings
`tmd/areas/prepare/target_sharing.py`	Add capgains_net synthetic variable, CD share functions
`tmd/areas/prepare_targets.py`	Add `prepare_targets_from_spec()`, unified CLI routing
`.gitignore`	Whitelist SOI CD CSV and recipe CSV files

CLI routing

Both states and CDs route through prepare_targets_from_spec(). The old
prepare_state_targets() and prepare_cd_targets() remain in the code but
are no longer called by the CLI.

A prerequisite step — prepare_shares — must be run before prepare_targets
to generate shares files. Shares only need regenerating when SOI data changes.

SALT targeting approach

Available SALT (e18400 income/sales, e18500 real estate) uses Census
state/local finance data for geographic distribution, targeted as all-bins
totals only. This is a change from the old pipeline which targeted per-bin
SALT using Census totals distributed by SOI bin proportions. The new approach
is more defensible:

SOI SALT is capped at $10K (TCJA), so per-bin SOI shares systematically
understate available SALT for high-income filers in high-tax states
Census measures actual tax collections — the right source for uncapped SALT
Per-bin decomposition would require combining Census totals with
cap-distorted SOI bin proportions — an approximation we can avoid

Deductible SALT (c18300) continues to use SOI shares per-bin — correct because
SOI directly measures what was actually deducted (after the cap).

Impact on state weights

State targets change from 179 (PR 1, old pipeline) to 169 (this PR, new
spec pipeline). The 10-target reduction comes from replacing 12 per-bin
e18400/e18500 rows with 2 total-only rows. State weights will differ from
PR 1 accordingly. The fingerprint is updated in this PR.

Test plan

make format                                                    # no changes
make lint                                                      # passes clean
make clean && make data                                        # build TMD + run all tests

# State pipeline (new spec-based)
python -m tmd.areas.prepare_shares --scope states              # generate states_shares.csv
python -m tmd.areas.prepare_targets --scope states             # generate 169 state targets
python -m pytest tests/test_prepare_targets.py -v              # verify targets
python -m tmd.areas.solve_weights --scope states --workers 12  # solve state weights; choose num workers
python -m pytest tests/test_state_weight_results.py -v         # verify weights
pytest tests/test_fingerprint.py -v                            # verify reproducibility
python -m tmd.areas.quality_report --scope states              # quality report

# CD target preparation (new spec pipeline, no weight solving in this PR)
python -m tmd.areas.prepare_shares --scope cds                 # generate cds_shares.csv
python -m tmd.areas.prepare_targets --scope cds                # generate 436 CD target files
python -m pytest tests/test_prepare_targets.py -v -k CD        # verify CD shares and targets

Prepared by @donboyd5 and Claude Code

…om_spec - constants.py: adds AreaType.CD, CD_AGI_CUTS, helper functions, extended ALL_SHARING_MAPPINGS - target_sharing.py: adds capgains_net to compute_tmd_national_sums, adds CD share functions - prepare_shares.py (new): share pre-computation with CTC duplicate shares fix - prepare_targets.py: adds prepare_targets_from_spec() routing through flat spec CSV - state_target_spec.csv (new): flat state target spec - cd_target_spec.csv (new): flat CD target spec Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- soi_cd_data.py (new): loads SOI CD data, applies 117->118 crosswalk, fixes A59664 unit inconsistency (raw data in thousands vs. dollars) - 22incd.csv (new): raw SOI CD data for 2022 - .gitignore: whitelist soi_cds/ and recipes/ CSV files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

martinholmer · 2026-03-25T20:14:32Z

@donboyd5, why are these five new tests you've added being skipped in PR #471?

[gw3] [ 22%] PASSED tests/test_prepare_targets.py::test_prepare_mn_end_to_end 
>>>[gw3] [ 23%] SKIPPED tests/test_prepare_targets.py::TestCDShares::test_no_duplicate_cd_shares 
>>>[gw3] [ 23%] SKIPPED tests/test_prepare_targets.py::TestCDShares::test_cd_shares_sum_to_one 
>>>[gw3] [ 24%] SKIPPED tests/test_prepare_targets.py::TestCDShares::test_cd_count_is_436 
>>>[gw3] [ 24%] SKIPPED tests/test_prepare_targets.py::TestCDTargetFiles::test_cd_target_count 
>>>[gw3] [ 24%] SKIPPED tests/test_prepare_targets.py::TestCDTargetFiles::test_cd_target_structure 
[gw3] [ 25%] PASSED tests/test_reweight.py::test_drop_impossible_targets_removes_all_zero_column

martinholmer · 2026-03-25T20:18:31Z

@donboyd5, also in PR #471 these skipped tests don't look right:

tests/test_prepare_targets.py::test_prepare_mn_end_to_end PASSED                                                 [ 66%]
tests/test_prepare_targets.py::TestCDShares::test_no_duplicate_cd_shares SKIPPED (CD shares file not found (...) [ 73%]
tests/test_prepare_targets.py::TestCDShares::test_cd_shares_sum_to_one SKIPPED (CD shares file not found (ru...) [ 80%]
tests/test_prepare_targets.py::TestCDShares::test_cd_count_is_436 SKIPPED (CD shares file not found (run pre...) [ 86%]
tests/test_prepare_targets.py::TestCDTargetFiles::test_cd_target_count SKIPPED (CD target files not found (r...) [ 93%]
tests/test_prepare_targets.py::TestCDTargetFiles::test_cd_target_structure SKIPPED (CD target files not foun...) [100%]
===================== 10 passed, 5 skipped in 5.45s =====

I don't understand why the "CD shares file" was not found or why the "CD targets file" was not found.
I thought these files were being generated in PR #471.

martinholmer · 2026-03-25T20:30:17Z

@donboyd5, when I download the PR #471 changes, I get 125 state violations (instead of the expected 35).

(base) TMD> diff states.act states.exp 
8,10c8,10
< States with violated targets: 30/51
< Total targets: 51 states × 169 = 8619
< Total violated targets: 125
---
> States with violated targets: 17/51
> Total targets: 51 states × 178 = 9124
> Total violated targets: 35

Did your forget to add the commit 13025ec changes to PR #471?

donboyd5 · 2026-03-25T20:32:19Z

@martinholmer, I'm sorry for the sloppiness. Will fix this shortly; am tied up for the next hour.

…

On Wed, Mar 25, 2026 at 4:30 PM Martin Holmer ***@***.***> wrote: *martinholmer* left a comment (PSLmodels/tax-microdata-benchmarking#471) <#471?email_source=notifications&email_token=ABR4JGCC3SC5ICHCSPIBP334SQ6W7A5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMJSHE2TONJZGQY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJNLQOJPWG33NNVSW45C7N5YGK3S7MNWGSY3L#issuecomment-4129575941> @donboyd5 <https://github.com/donboyd5>, when I download the PR #471 <#471> changes, I get 125 state violations (instead of the expected 35). (base) TMD> diff states.act states.exp 8,10c8,10 < States with violated targets: 30/51 < Total targets: 51 states × 169 = 8619 < Total violated targets: 125 --- > States with violated targets: 17/51 > Total targets: 51 states × 178 = 9124 > Total violated targets: 35 Did your forget to add the commit 13025ec <13025ec> changes to PR #471 <#471>? — Reply to this email directly, view it on GitHub <#471?email_source=notifications&email_token=ABR4JGCC3SC5ICHCSPIBP334SQ6W7A5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMJSHE2TONJZGQY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJNLQOJPWG33NNVSW45C7N5YGK3S7MNWGSY3L#issuecomment-4129575941>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABR4JGEBBABLKUJFKWEUML34SQ6W7AVCNFSM6AAAAACW7HXUBWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DCMRZGU3TKOJUGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

donboyd5 and others added 3 commits March 25, 2026 11:11

update state fingerprint

07156a2

donboyd5 requested a review from martinholmer March 25, 2026 17:54

Remove PR_MESSAGE.md

8b92cde

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

martinholmer marked this pull request as draft March 25, 2026 19:37

Base automatically changed from pr1-solver-robustness to master March 25, 2026 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spec-based target pipeline with SOI CD data ingestion#471

Add spec-based target pipeline with SOI CD data ingestion#471
donboyd5 wants to merge 4 commits intomasterfrom
pr2-spec-targets

donboyd5 commented Mar 25, 2026 •

edited

Loading

Uh oh!

martinholmer commented Mar 25, 2026

Uh oh!

martinholmer commented Mar 25, 2026

Uh oh!

martinholmer commented Mar 25, 2026

Uh oh!

donboyd5 commented Mar 25, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

donboyd5 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR: Add spec-based target pipeline with SOI CD data ingestion

Summary

Architecture

Key design decisions

New files

Modified files

CLI routing

SALT targeting approach

Impact on state weights

Test plan

Uh oh!

martinholmer commented Mar 25, 2026

Uh oh!

martinholmer commented Mar 25, 2026

Uh oh!

martinholmer commented Mar 25, 2026

Uh oh!

donboyd5 commented Mar 25, 2026 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

donboyd5 commented Mar 25, 2026 •

edited

Loading