Add spec-based target pipeline with SOI CD data ingestion#471
Draft
Add spec-based target pipeline with SOI CD data ingestion#471
Conversation
…om_spec - constants.py: adds AreaType.CD, CD_AGI_CUTS, helper functions, extended ALL_SHARING_MAPPINGS - target_sharing.py: adds capgains_net to compute_tmd_national_sums, adds CD share functions - prepare_shares.py (new): share pre-computation with CTC duplicate shares fix - prepare_targets.py: adds prepare_targets_from_spec() routing through flat spec CSV - state_target_spec.csv (new): flat state target spec - cd_target_spec.csv (new): flat CD target spec Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- soi_cd_data.py (new): loads SOI CD data, applies 117->118 crosswalk, fixes A59664 unit inconsistency (raw data in thousands vs. dollars) - 22incd.csv (new): raw SOI CD data for 2022 - .gitignore: whitelist soi_cds/ and recipes/ CSV files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
|
@donboyd5, why are these five new tests you've added being skipped in PR #471? |
Collaborator
|
@donboyd5, also in PR #471 these skipped tests don't look right: I don't understand why the "CD shares file" was not found or why the "CD targets file" was not found. |
Collaborator
|
@donboyd5, when I download the PR #471 changes, I get 125 state violations (instead of the expected 35). Did your forget to add the commit 13025ec changes to PR #471? |
Collaborator
Author
|
@martinholmer, I'm sorry for the sloppiness. Will fix this shortly; am tied
up for the next hour.
…On Wed, Mar 25, 2026 at 4:30 PM Martin Holmer ***@***.***> wrote:
*martinholmer* left a comment (PSLmodels/tax-microdata-benchmarking#471)
<#471?email_source=notifications&email_token=ABR4JGCC3SC5ICHCSPIBP334SQ6W7A5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMJSHE2TONJZGQY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJNLQOJPWG33NNVSW45C7N5YGK3S7MNWGSY3L#issuecomment-4129575941>
@donboyd5 <https://github.com/donboyd5>, when I download the PR #471
<#471>
changes, I get 125 state violations (instead of the expected 35).
(base) TMD> diff states.act states.exp
8,10c8,10
< States with violated targets: 30/51
< Total targets: 51 states × 169 = 8619
< Total violated targets: 125
---
> States with violated targets: 17/51
> Total targets: 51 states × 178 = 9124
> Total violated targets: 35
Did your forget to add the commit 13025ec
<13025ec>
changes to PR #471
<#471>?
—
Reply to this email directly, view it on GitHub
<#471?email_source=notifications&email_token=ABR4JGCC3SC5ICHCSPIBP334SQ6W7A5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMJSHE2TONJZGQY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJNLQOJPWG33NNVSW45C7N5YGK3S7MNWGSY3L#issuecomment-4129575941>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABR4JGEBBABLKUJFKWEUML34SQ6W7AVCNFSM6AAAAACW7HXUBWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DCMRZGU3TKOJUGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR: Add spec-based target pipeline with SOI CD data ingestion
Summary
Replace the hardcoded target generation logic with a clean three-artifact
architecture that separates concerns by change frequency:
data. Changes only with a new SOI vintage (~annually).
recipe tuning.
target = TMD_national_sum * share. Recomputedwhenever TMD data or the spec changes.
This PR also adds SOI congressional district data ingestion with the
117th-to-118th Congress crosswalk, and a workaround for a confirmed bug in the
source SOI data (column A59664 in dollars instead of $1,000s).
Architecture
Key design decisions
sums to 1.0 across all CDs for each variable/bin.
at-large states (AK, DC, DE, ND, SD, VT, WY), and split districts.
in dollars instead of $1,000s. Divided by 1000 on ingestion. State file is not
affected. @donboyd5 will report the bug to the IRS SOI unit.
via
ALL_SHARING_MAPPINGSin constants.py. Multiple TMD variables can shareone SOI proxy (e.g., e01500 and e01700 both use SOI 01700).
New files
tmd/areas/prepare_shares.pytmd/areas/prepare/soi_cd_data.pytmd/areas/prepare/recipes/cd_target_spec.csvtmd/areas/prepare/recipes/state_target_spec.csvtmd/areas/prepare/data/soi_cds/22incd.csvModified files
tmd/areas/prepare/constants.pytmd/areas/prepare/target_sharing.pytmd/areas/prepare_targets.pyprepare_targets_from_spec(), unified CLI routing.gitignoreCLI routing
Both states and CDs route through
prepare_targets_from_spec(). The oldprepare_state_targets()andprepare_cd_targets()remain in the code butare no longer called by the CLI.
A prerequisite step —
prepare_shares— must be run beforeprepare_targetsto generate shares files. Shares only need regenerating when SOI data changes.
SALT targeting approach
Available SALT (e18400 income/sales, e18500 real estate) uses Census
state/local finance data for geographic distribution, targeted as all-bins
totals only. This is a change from the old pipeline which targeted per-bin
SALT using Census totals distributed by SOI bin proportions. The new approach
is more defensible:
understate available SALT for high-income filers in high-tax states
cap-distorted SOI bin proportions — an approximation we can avoid
Deductible SALT (c18300) continues to use SOI shares per-bin — correct because
SOI directly measures what was actually deducted (after the cap).
Impact on state weights
State targets change from 179 (PR 1, old pipeline) to 169 (this PR, new
spec pipeline). The 10-target reduction comes from replacing 12 per-bin
e18400/e18500 rows with 2 total-only rows. State weights will differ from
PR 1 accordingly. The fingerprint is updated in this PR.
Test plan
Prepared by @donboyd5 and Claude Code