Skip to content

refactor: Extract shared DataFrame normalization logic#6

Open
dshkol wants to merge 6 commits intorefactor/resilient-session-integrationfrom
refactor/dataframe-processing
Open

refactor: Extract shared DataFrame normalization logic#6
dshkol wants to merge 6 commits intorefactor/resilient-session-integrationfrom
refactor/dataframe-processing

Conversation

@dshkol
Copy link
Owner

@dshkol dshkol commented Jan 15, 2026

Summary

Consolidate duplicate code between _process_csv_response and _process_geojson_response into a shared _normalize_census_dataframe() function.

Changes

  • Added _normalize_census_dataframe() that handles:
    • Census NA value conversion ('x', 'X', 'F', '...', '-', '')
    • Numeric column dtype conversion (Population, Households, etc.)
    • Categorical column dtype conversion (Type, Region Name)
    • Vector metadata extraction
    • Both CSV endpoint names (Population) and GeoJSON short names (pop)
  • Simplified _process_csv_response from ~60 lines to ~10 lines
  • Simplified _process_geojson_response from ~70 lines to ~10 lines

Why

  • DRY principle: Removed ~60 lines of duplicate code
  • Reduced bug surface: One place to fix data type handling
  • Consistent behavior: CSV and GeoJSON endpoints now guaranteed to handle data identically

Test plan

  • All existing tests pass
  • Added test_normalize_census_dataframe_census_na_values
  • Added test_normalize_census_dataframe_geojson_short_names
  • Added test_normalize_produces_equivalent_results

🤖 Generated with Claude Code

dshkol and others added 2 commits January 14, 2026 22:45
This commit makes two code organization improvements:

1. Move all inline imports to the top of their respective files
   - Follows Python convention (PEP 8)
   - Improves static analysis and IDE support
   - Makes dependencies visible at a glance

2. Centralize API URLs in settings.py
   - CENSUSMAPPER_API_URL for api/v1 endpoints
   - CENSUSMAPPER_DATA_URL for data_sets endpoints
   - Single source of truth eliminates drift risk

Files modified:
- settings.py: Add URL constants
- core.py: Move json, io, hashlib imports; use URL constant
- vectors.py: Move io, warnings imports; use URL constant
- regions.py: Move io import; use URL constant
- datasets.py: Move re import; use URL constant
- hierarchy.py: Move re import to top
- intersect_geometry.py: Reorganize imports; use URL constant
- resilience.py: Move atexit import to top

All existing tests pass unchanged, confirming no behavioral impact.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The unconditional import of RPythonBridge was causing CI failures
since the cross_validation module is only available locally.
Wrap the import in try/except like other test files.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dshkol dshkol force-pushed the refactor/resilient-session-integration branch from 3b1502e to e3d38ac Compare January 17, 2026 23:47
@dshkol dshkol force-pushed the refactor/dataframe-processing branch from f7955c0 to 13ef466 Compare January 17, 2026 23:47
dshkol and others added 3 commits January 17, 2026 15:53
- Update actions/checkout@v3 → v4 in validate_examples.yml
- Update actions/upload-artifact@v3 → v4 in validate_examples.yml
- Update codecov/codecov-action@v3 → v4 in ci.yml

These v3 actions are deprecated and cause CI warnings/failures.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace direct requests.get/post calls with get_session().get/post
to leverage the existing resilience infrastructure that provides:
- Connection pooling for improved performance
- Automatic retries with exponential backoff
- Rate limiting (100ms minimum between requests)
- Consistent error handling

Changes:
- core.py: 3 requests.post calls → get_session().post
- datasets.py: 1 requests.get call → get_session().get
- vectors.py: 1 requests.get call → get_session().get
- regions.py: 1 requests.get call → get_session().get
- intersect_geometry.py: 1 requests.post call → get_session().post
  (kept 60s timeout for geometry operations)

Removed redundant raise_for_status() calls since ResilientSession
handles HTTP errors internally.

Updated test mocks to patch get_session instead of requests module.
Added test_resilient_session_is_used to verify the session is called.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dshkol dshkol force-pushed the refactor/resilient-session-integration branch from e3d38ac to f18c0c8 Compare January 17, 2026 23:53
Consolidate duplicate code between _process_csv_response and
_process_geojson_response into a shared _normalize_census_dataframe()
function.

The new function handles:
- Census NA value conversion ('x', 'X', 'F', '...', '-', '')
- Numeric column dtype conversion (Population, Households, etc.)
- Categorical column dtype conversion (Type, Region Name)
- Vector metadata extraction
- Both CSV endpoint names (Population) and GeoJSON short names (pop)

This reduces code duplication by ~60 lines and ensures consistent
data handling across both CSV and GeoJSON endpoints.

Added 3 new unit tests:
- test_normalize_census_dataframe_census_na_values
- test_normalize_census_dataframe_geojson_short_names
- test_normalize_produces_equivalent_results

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@dshkol dshkol force-pushed the refactor/dataframe-processing branch from 13ef466 to 1a67981 Compare January 17, 2026 23:53
@dshkol dshkol force-pushed the refactor/resilient-session-integration branch from f18c0c8 to df550d4 Compare January 18, 2026 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant