-
Notifications
You must be signed in to change notification settings - Fork 10
Add stacked dataset builder and P(county|CD) distributions #457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
baogorek
wants to merge
21
commits into
main
Choose a base branch
from
district-h5
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Core components: - sparse_matrix_builder.py: Database-driven approach for building calibration matrices - calibration_utils.py: Shared utilities (cache clearing, constraints, geo helpers) - matrix_tracer.py: Debugging utility for tracing through sparse matrices - create_stratified_cps.py: Create stratified sample preserving high-income households - test_sparse_matrix_builder.py: 6 verification tests for matrix correctness Data pipeline changes: - Add GEO_STACKING env var to cps.py and puf.py for geo-stacking data generation - Add GEO_STACKING_MODE env var to extended_cps.py - Add CPS_2024_Full, PUF_2023, ExtendedCPS_2023 classes - Add policy_data.db download to prerequisites - Add 'make data-geo' target for geo-stacking data pipeline CI/CD: - Add geo-stacking dataset build step to workflow - Add sparse matrix builder test step after geo data generation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Move sparse matrix tests to tests/test_local_area_calibration/ - Split large test file into focused modules (column indexing, same-state, cross-state, geo masking) - Fix small_enhanced_cps.py enum encoding (decode_to_str before astype) - Fix create_stratified_cps.py to use local storage instead of HuggingFace - Remove CPS_2024_Full to keep PR minimal - Revert ExtendedCPS_2024 to use CPS_2024 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…tionality - Rename GEO_STACKING to LOCAL_AREA_CALIBRATION in cps.py, puf.py, extended_cps.py - Rename data-geo to data-local-area in Makefile and workflow - Add create_target_groups function to calibration_utils.py - Enhance MatrixTracer with get_group_rows method and variable_desc in row catalog - Add TARGET GROUPS section to print_matrix_structure output - Add local_area_calibration_setup.ipynb documentation notebook 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…t builder - Add make_county_cd_distributions.py to compute P(county|CD) from Census block data - Add county_cd_distributions.csv with distributions for all 436 CDs - Add county_assignment.py module for assigning counties to households - Add stacked_dataset_builder.py for creating CD-stacked H5 datasets - Add tests for county assignment functionality - Update calibration_utils.py with state/CD mapping utilities 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Collaborator
Author
|
Closes #458 |
- New GitHub Actions workflow (local_area_publish.yaml) that: - Triggers on local_area_calibration/ changes, repository_dispatch, or manual - Downloads calibration inputs from HF calibration/ folder - Builds 51 state + 436 district H5 files with checkpointing - Uploads to GCP and HF states/ and districts/ subdirectories - New publish_local_area.py script with: - Per-state and per-district checkpointing for spot instance resilience - Immediate upload after each file is built - Support for --states-only, --districts-only, --skip-download flags - Added upload_local_area_file() to data_upload.py for subdirectory uploads - Added download_calibration_inputs() to huggingface.py - Added publish-local-area Makefile target 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- download_private_prerequisites.py: Download from calibration/policy_data.db - calibration_utils.py: Look for db in storage/calibration/ - conftest.py: Update test fixture path - huggingface.py: Fix download_calibration_inputs to return correct paths 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Create a minimal 50-household H5 fixture with known values for stable testing of the stacked dataset builder without relying on sampled stratified CPS data. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Cast np.arange output to int32 to match column dtype. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
c9c6fd8 to
18d635a
Compare
…ication - Add spm_unit_tenure_type mapping from SPM_TENMORTSTATUS in add_spm_variables - Fix create_stratified_cps.py to use source sim's input_variables instead of empty sim - Fix stacked_dataset_builder.py to use base_sim's input_variables instead of sparse_sim The input_variables fix ensures variables like spm_unit_tenure_type are preserved when creating stratified/stacked datasets, since input_variables is only populated from variables that have actual data in the loaded dataset. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…lder - Add spm-calculator integration for SPM threshold calculation - Replace random placeholder geoadj with real values from Census ACS rent data - Add load_cd_geoadj_values() to compute geoadj from median 2BR rents - Add calculate_spm_thresholds_for_cd() to calculate SPM thresholds per CD - Add CD rent data CSV and fetch script (requires CENSUS_API_KEY) - Update .gitignore to track rent CSV 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add upload_local_area_batch_to_hf() to batch multiple files per commit - Add skip_hf parameter to upload_local_area_file() for GCP-only uploads - Modify publish_local_area.py to batch HF uploads (10 files per commit) - Fix at-large district geoadj lookup (XX01 -> XX00 mapping for AK, DE, etc.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
… to gitignore Pseudo-inputs are variables with adds/subtracts that aggregate formula-based components. Saving their stale pre-computed values corrupts calculations when the dataset is reloaded. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…e_type - Accept main's SPM threshold calculation using calculate_spm_thresholds_with_geoadj() - Preserve branch's spm_unit_tenure_type variable for local area calibration - Refactor calibration_utils.py to import TENURE_CODE_MAP from utils/spm.py - Remove duplicate SPM_TENURE_CODE_TO_CALC definition 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
stacked_dataset_builder.pyfor creating CD-stacked H5 datasets from calibrated weightscounty_assignment.pymodule for assigning counties to households based on congressional districtKey Features
Test Plan
test_county_assignment.py)🤖 Generated with Claude Code