Skip to content

feat: Multiple ETL refinements and fixes#260

Merged
petercarbsmith merged 33 commits intomainfrom
rebased_etl_fixes_clean
Apr 14, 2026
Merged

feat: Multiple ETL refinements and fixes#260
petercarbsmith merged 33 commits intomainfrom
rebased_etl_fixes_clean

Conversation

@petercarbsmith
Copy link
Copy Markdown
Collaborator

📄 Description

This PR consolidates several critical ETL improvements, bug fixes, and refactors that were previously developed in the etl_fixes branch. It has been rebased onto the latest upstream/main (commit d961724) to ensure compatibility with recent structural changes to the data portal views and Alembic migration workflow.

Key Workstreams Included:

  1. Field Sample ETL v03 Refactor: Migration from the legacy single-worksheet architecture to a robust, multi-worksheet strategy (SampleMetadata_v03-BioCirV).
  2. County Ag Report ETL: Complete pipeline for extracting and loading California county agricultural production data.
  3. Resource Images ETL: New infrastructure for managing and linking images to resource records.
  4. Fermentation & Pretreatment Fixes: Improved normalization, unit handling (e.g., inoculum_volume_l), duplicate record prevention, and Strain Normalization.
  5. Flow Orchestration: Updated master ETL flow to include new pipelines.

✅ Checklist

  • I ran pre-commit run --all-files and all checks pass
  • Tests added/updated where needed (42 tests passed)
  • Docs added/updated if applicable
  • I have linked the issue this PR closes (if any)

🔗 Related Issues

Resolves # (Multiple internal work items for ETL cleanup and refactoring)

💡 Type of change

Type Checked?
🐞 Bug fix [x]
✨ New feature [x]
📝 Documentation [ ]
♻️ Refactor [x]
🛠️ Build/CI [ ]
Other (explain) [ ]

🧪 How to test

1. Run Integration Tests

pixi run test tests/pipeline/test_field_sample_v03_integration.py
pixi run test tests/pipeline/test_county_ag_report_etl.py
pixi run test tests/pipeline/test_fermentation_record_etl.py
pixi run test tests/pipeline/test_resource_images_etl.py

Expected Results: All 42 tests should pass.

2. Run Quality Checks

pixi run pre-commit-all

Expected Results: All checks should pass.

📝 Notes to reviewers

Implementation Highlights:

  • Field Sample v03: Normalizes data across four worksheets (Sample_IDs, Sample_Desc, Qty_FieldStorage, Producers). Uses multi-way left-joins to ensure all 137 base sample records are preserved.
  • County Ag Report: Maps county names to Place.geoid and handles unit normalizations for crop production.
  • Prefect Infrastructure: Fixed timezone errors in docker-compose.yml (TZ=UTC) and corrected volume mounts.
  • Data Integrity: Added empty-dataframe guards in pretreatment transforms and fixed unit normalization case mismatches in fermentation tests.
  • Strain Normalization: Implemented a robust mapping for fermentation strains. The pipeline now extracts strain names from the BioConv_Method source column, populates a unique strain lookup table, and correctly links FermentationRecord.strain_id during transformation.

Migration Notes:

The PR includes alembic/versions/bd227e99e006_add_fermentation_method_fields_resource_.py which:

  • Adds the necessary fields for fermentation and resource images.
  • Adds a unique constraint on strain.name to support idempotent upserts.
  • Is compatible with the current database head.

- Add two actual migrations (drop incumbents, recreate test view)
- Create alembic/AGENTS.md with migration template patterns
- Create DATA_PORTAL_VIEWS_REFACTOR.md comprehensive guide
- Create Phase 5 next steps plan documenting remaining tasks
- All views ready for one-by-one recreation with new modular approach
- Readonly user permissions and indexes documented
This commit addresses the fragility of SQLAlchemy-generated migrations when
replaying from scratch (teardown→rebuild scenarios).

Problem: When SQLAlchemy models are imported at migration replay time,
if schema has changed since the migration was created, the view fails to build
and breaks the entire migration chain.

Solution: Embed raw SQL as immutable strings in migration files. This is
the industry-standard pattern (Liquibase, Flyway, major Alembic projects).

Changes:

1. alembic/AGENTS.md - UPDATED
   - Clarified that raw SQL snapshots are the recommended approach
   - Added section explaining why (teardown→rebuild safety)
   - Documented both recommended pattern (raw SQL) and legacy pattern (imports)
   - Updated key patterns section

2. alembic/versions/9e8f7a6b5c4d_drop_incumbent_data_portal_views.py - FIXED
   - Changed down_revision from '63c0fedd3446' to '60b08397200f'
   - Resolved alembic multiple heads issue

3. alembic/versions/9e8f7a6b5c4e_recreate_mv_biomass_search_with_raw_sql.py - NEW
   - Example migration showing raw SQL snapshot pattern
   - Demonstrates DROP → COMPILE → CREATE → INDEX → GRANT pattern
   - SQL is embedded as immutable string, not runtime-evaluated

4. alembic/VIEW_SQL_REFERENCE.md - NEW
   - Reference documentation for all compiled view SQL
   - Copy from here when creating new migrations
   - Includes indexes for each view

5. scripts/extract_view_sql.py - NEW
   - Utility to extract compiled SQL from SQLAlchemy view definitions
   - Run this when view definitions change and you need to update migrations

6. scripts/generate_raw_sql_migration.py - NEW
   - Helper script for generating migration templates with raw SQL

Key Benefits:
- Migrations work on any replay, even with future schema changes
- Full audit trail via migration history
- Industry-standard approach
- No runtime dependency on current SQLAlchemy definitions
- Consolidated migration: 9e8f7a6b5c4f_recreate_remaining_8_views_with_raw_sql.py
  - Recreates all 8 remaining materialized views with raw SQL snapshots
  - Single atomic operation (safer than 8 individual migrations)
  - Follows pattern: DROP → CREATE → INDEX → GRANT
  - Syntax verified and ready for application

- Generator script: scripts/generate_view_migrations.py
  - Demonstrates automated migration generation approach
  - Reference for future view migrations if needed

All 8 views included in consolidation:
- mv_biomass_availability
- mv_biomass_composition
- mv_biomass_county_production
- mv_biomass_sample_stats
- mv_biomass_fermentation
- mv_biomass_gasification
- mv_biomass_pricing
- mv_usda_county_production

Previous individual migrations cleaned up (now deleted):
- 9e8f7a6b5c4d_drop_incumbent_data_portal_views.py
- 9e8f7a6b5c4d_recreate_mv_biomass_search_with_modular_approach.py
- 9e8f7a6b5c4e_recreate_mv_biomass_search_with_raw_sql.py
Changed production_energy_content_unit_id to energy_content_unit_id
to match the actual database schema in billion_ton2023_record table.
PostgreSQL GRANT syntax updated to explicitly grant SELECT on each
materialized view individually rather than using bulk ALL syntax.

Views granted permissions:
- mv_biomass_availability
- mv_biomass_composition
- mv_biomass_county_production
- mv_biomass_sample_stats
- mv_biomass_fermentation
- mv_biomass_gasification
- mv_biomass_pricing
- mv_usda_county_production

Migration 9e8f7a6b5c4f now applies successfully.
- Added TZ=UTC environment variable to prefect-server and prefect-worker
- Added /etc/timezone and /etc/localtime volume mounts for timezone support
- Fixes 'whenever.TimeZoneNotFoundError: No time zone found at path /etc/localtime'
  when running Prefect flows

This resolves the issue when attempting to run ETL flows via Prefect CLI.
- Create comprehensive integration test suite (18 tests covering extract, transform, load)
- Add pytest fixtures with realistic mock data (137, 104, 130, 64 rows)
- Register flow with run_prefect_flow.py orchestrator
- Execute flow with real Google Sheets data - all extractors and transforms successful
- Fix critical provider_id population bug: normalize column name 'providercode' (no underscore)
- Pass all pre-commit quality checks (linting, formatting, spell check, YAML validation)
- Test validation: multi-way join preserves all 137 base records, LocationAddress deduplication working, field extraction quality verified
- Remove deprecated src/ca_biositing/pipeline/etl/extract/samplemetadata.py
- Remove old v01/v02 transform files:
  - src/ca_biositing/pipeline/etl/transform/field_sampling/field_sample.py
  - src/ca_biositing/pipeline/etl/transform/field_sampling/location_address.py
- Remove associated old unit tests:
  - src/ca_biositing/pipeline/tests/test_field_sample_transform.py
  - src/ca_biositing/pipeline/tests/test_location_address_transform.py

v03 extractors and transforms are now the canonical implementation:
- sample_ids, sample_desc, qty_field_storage, producers extractors
- field_sample_v03, location_address_v03 transforms
- Comprehensive integration test suite in tests/pipeline/
- Create resource_images extract module using factory pattern
- Create resource_image transform module with normalization and lineage tracking
- Create resource_image load module with upsert pattern
- Update resource_information flow with proper dependency ordering
- Add ResourceImage to models __init__ exports
- Add comprehensive test suite (16 tests, all passing)
- All pre-commit checks passed

Implements Phase 2 of etl_improvements_plan.md with:
- Extract from Google Sheets worksheet '08.0_Resource_images'
- Transform with resource name normalization to resource_id
- Load with upsert on (resource_id, image_url) unique constraint
- Proper ETL lineage tracking and dependency ordering
@read-the-docs-community
Copy link
Copy Markdown

read-the-docs-community bot commented Apr 13, 2026

@mglbleta
Copy link
Copy Markdown
Collaborator

mglbleta commented Apr 14, 2026

Hey @petercarbsmith Peter!

The ETL seems to be failing for me at multiple junctures:

image image

County Ag Report tests also fail due to missing modules (county_ag_report_record / county_ag_report_observation in transform/load analysis). Could it be incomplete commit?

image

I'll take another look tomorrow. Let me know if you want me to resolve the error and upload commit on my end or let you add another commit and try again. Thanks!

@petercarbsmith
Copy link
Copy Markdown
Collaborator Author

Hey @mglbleta. I am guessing the first one is failing because you will have to share your credentials.json email with the Samplemetadata V.03 sheet. Not sure what is up with that county ag report record though. Looks like it is having trouble finding the modules. It's all working for me locally, but I can try to push again and see if it fixes it.

Copy link
Copy Markdown
Collaborator

@mglbleta mglbleta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Peter! Finished review. Nothing critical, just some questions from me. You or I can merge after you see my notes.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add these to DB or are they kind of excess? @petercarbsmith

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh these were me exploring the data with the AI. Actually I should probably just remove these from the repo also!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with this exploration file

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is "explore" part of the ETL or a one off exploration you did?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep one off type thing.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Peter! Curious for other analyses adding to the data_source table, do we anticipate adding to this one transform/analysis/data_source file?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good flag and honestly yea this crossed my mind while doing this that it seems like the less scalable way of handling data source. I think we will at some point want to revisit this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we eventually remove the _v03 suffix? I think it's a little non intuitive for long term for database. not a major issue, just a thought

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm yea I should probably go in and clean this up. Thanks!

@petercarbsmith petercarbsmith merged commit bd0ec5e into main Apr 14, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants