refactor: Extract shared DataFrame normalization logic by dshkol · Pull Request #6 · dshkol/pycancensus

dshkol · 2026-01-15T07:00:35Z

Summary

Consolidate duplicate code between _process_csv_response and _process_geojson_response into a shared _normalize_census_dataframe() function.

Changes

Added _normalize_census_dataframe() that handles:
- Census NA value conversion ('x', 'X', 'F', '...', '-', '')
- Numeric column dtype conversion (Population, Households, etc.)
- Categorical column dtype conversion (Type, Region Name)
- Vector metadata extraction
- Both CSV endpoint names (Population) and GeoJSON short names (pop)
Simplified _process_csv_response from ~60 lines to ~10 lines
Simplified _process_geojson_response from ~70 lines to ~10 lines

Why

DRY principle: Removed ~60 lines of duplicate code
Reduced bug surface: One place to fix data type handling
Consistent behavior: CSV and GeoJSON endpoints now guaranteed to handle data identically

Test plan

All existing tests pass
Added test_normalize_census_dataframe_census_na_values
Added test_normalize_census_dataframe_geojson_short_names
Added test_normalize_produces_equivalent_results

🤖 Generated with Claude Code

This commit makes two code organization improvements: 1. Move all inline imports to the top of their respective files - Follows Python convention (PEP 8) - Improves static analysis and IDE support - Makes dependencies visible at a glance 2. Centralize API URLs in settings.py - CENSUSMAPPER_API_URL for api/v1 endpoints - CENSUSMAPPER_DATA_URL for data_sets endpoints - Single source of truth eliminates drift risk Files modified: - settings.py: Add URL constants - core.py: Move json, io, hashlib imports; use URL constant - vectors.py: Move io, warnings imports; use URL constant - regions.py: Move io import; use URL constant - datasets.py: Move re import; use URL constant - hierarchy.py: Move re import to top - intersect_geometry.py: Reorganize imports; use URL constant - resilience.py: Move atexit import to top All existing tests pass unchanged, confirming no behavioral impact. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The unconditional import of RPythonBridge was causing CI failures since the cross_validation module is only available locally. Wrap the import in try/except like other test files. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Update actions/checkout@v3 → v4 in validate_examples.yml - Update actions/upload-artifact@v3 → v4 in validate_examples.yml - Update codecov/codecov-action@v3 → v4 in ci.yml These v3 actions are deprecated and cause CI warnings/failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace direct requests.get/post calls with get_session().get/post to leverage the existing resilience infrastructure that provides: - Connection pooling for improved performance - Automatic retries with exponential backoff - Rate limiting (100ms minimum between requests) - Consistent error handling Changes: - core.py: 3 requests.post calls → get_session().post - datasets.py: 1 requests.get call → get_session().get - vectors.py: 1 requests.get call → get_session().get - regions.py: 1 requests.get call → get_session().get - intersect_geometry.py: 1 requests.post call → get_session().post (kept 60s timeout for geometry operations) Removed redundant raise_for_status() calls since ResilientSession handles HTTP errors internally. Updated test mocks to patch get_session instead of requests module. Added test_resilient_session_is_used to verify the session is called. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Consolidate duplicate code between _process_csv_response and _process_geojson_response into a shared _normalize_census_dataframe() function. The new function handles: - Census NA value conversion ('x', 'X', 'F', '...', '-', '') - Numeric column dtype conversion (Population, Households, etc.) - Categorical column dtype conversion (Type, Region Name) - Vector metadata extraction - Both CSV endpoint names (Population) and GeoJSON short names (pop) This reduces code duplication by ~60 lines and ensures consistent data handling across both CSV and GeoJSON endpoints. Added 3 new unit tests: - test_normalize_census_dataframe_census_na_values - test_normalize_census_dataframe_geojson_short_names - test_normalize_produces_equivalent_results Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

dshkol and others added 2 commits January 14, 2026 22:45

dshkol force-pushed the refactor/resilient-session-integration branch from 3b1502e to e3d38ac Compare January 17, 2026 23:47

dshkol force-pushed the refactor/dataframe-processing branch from f7955c0 to 13ef466 Compare January 17, 2026 23:47

dshkol and others added 3 commits January 17, 2026 15:53

style: Format test files with black

f18c0c8

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

dshkol force-pushed the refactor/resilient-session-integration branch from e3d38ac to f18c0c8 Compare January 17, 2026 23:53

dshkol force-pushed the refactor/dataframe-processing branch from 13ef466 to 1a67981 Compare January 17, 2026 23:53

dshkol force-pushed the refactor/resilient-session-integration branch from f18c0c8 to df550d4 Compare January 18, 2026 01:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Extract shared DataFrame normalization logic#6

refactor: Extract shared DataFrame normalization logic#6
dshkol wants to merge 6 commits intorefactor/resilient-session-integrationfrom
refactor/dataframe-processing

dshkol commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dshkol commented Jan 15, 2026

Summary

Changes

Why

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant