feat: Multiple ETL refinements and fixes#260
Conversation
- Add two actual migrations (drop incumbents, recreate test view) - Create alembic/AGENTS.md with migration template patterns - Create DATA_PORTAL_VIEWS_REFACTOR.md comprehensive guide - Create Phase 5 next steps plan documenting remaining tasks - All views ready for one-by-one recreation with new modular approach - Readonly user permissions and indexes documented
This commit addresses the fragility of SQLAlchemy-generated migrations when replaying from scratch (teardown→rebuild scenarios). Problem: When SQLAlchemy models are imported at migration replay time, if schema has changed since the migration was created, the view fails to build and breaks the entire migration chain. Solution: Embed raw SQL as immutable strings in migration files. This is the industry-standard pattern (Liquibase, Flyway, major Alembic projects). Changes: 1. alembic/AGENTS.md - UPDATED - Clarified that raw SQL snapshots are the recommended approach - Added section explaining why (teardown→rebuild safety) - Documented both recommended pattern (raw SQL) and legacy pattern (imports) - Updated key patterns section 2. alembic/versions/9e8f7a6b5c4d_drop_incumbent_data_portal_views.py - FIXED - Changed down_revision from '63c0fedd3446' to '60b08397200f' - Resolved alembic multiple heads issue 3. alembic/versions/9e8f7a6b5c4e_recreate_mv_biomass_search_with_raw_sql.py - NEW - Example migration showing raw SQL snapshot pattern - Demonstrates DROP → COMPILE → CREATE → INDEX → GRANT pattern - SQL is embedded as immutable string, not runtime-evaluated 4. alembic/VIEW_SQL_REFERENCE.md - NEW - Reference documentation for all compiled view SQL - Copy from here when creating new migrations - Includes indexes for each view 5. scripts/extract_view_sql.py - NEW - Utility to extract compiled SQL from SQLAlchemy view definitions - Run this when view definitions change and you need to update migrations 6. scripts/generate_raw_sql_migration.py - NEW - Helper script for generating migration templates with raw SQL Key Benefits: - Migrations work on any replay, even with future schema changes - Full audit trail via migration history - Industry-standard approach - No runtime dependency on current SQLAlchemy definitions
- Consolidated migration: 9e8f7a6b5c4f_recreate_remaining_8_views_with_raw_sql.py - Recreates all 8 remaining materialized views with raw SQL snapshots - Single atomic operation (safer than 8 individual migrations) - Follows pattern: DROP → CREATE → INDEX → GRANT - Syntax verified and ready for application - Generator script: scripts/generate_view_migrations.py - Demonstrates automated migration generation approach - Reference for future view migrations if needed All 8 views included in consolidation: - mv_biomass_availability - mv_biomass_composition - mv_biomass_county_production - mv_biomass_sample_stats - mv_biomass_fermentation - mv_biomass_gasification - mv_biomass_pricing - mv_usda_county_production Previous individual migrations cleaned up (now deleted): - 9e8f7a6b5c4d_drop_incumbent_data_portal_views.py - 9e8f7a6b5c4d_recreate_mv_biomass_search_with_modular_approach.py - 9e8f7a6b5c4e_recreate_mv_biomass_search_with_raw_sql.py
Changed production_energy_content_unit_id to energy_content_unit_id to match the actual database schema in billion_ton2023_record table.
PostgreSQL GRANT syntax updated to explicitly grant SELECT on each materialized view individually rather than using bulk ALL syntax. Views granted permissions: - mv_biomass_availability - mv_biomass_composition - mv_biomass_county_production - mv_biomass_sample_stats - mv_biomass_fermentation - mv_biomass_gasification - mv_biomass_pricing - mv_usda_county_production Migration 9e8f7a6b5c4f now applies successfully.
- Added TZ=UTC environment variable to prefect-server and prefect-worker - Added /etc/timezone and /etc/localtime volume mounts for timezone support - Fixes 'whenever.TimeZoneNotFoundError: No time zone found at path /etc/localtime' when running Prefect flows This resolves the issue when attempting to run ETL flows via Prefect CLI.
…te f98d1a9fe9a7 parent
…/ca-biositing into fix-views_alembic_refactor
- Create comprehensive integration test suite (18 tests covering extract, transform, load) - Add pytest fixtures with realistic mock data (137, 104, 130, 64 rows) - Register flow with run_prefect_flow.py orchestrator - Execute flow with real Google Sheets data - all extractors and transforms successful - Fix critical provider_id population bug: normalize column name 'providercode' (no underscore) - Pass all pre-commit quality checks (linting, formatting, spell check, YAML validation) - Test validation: multi-way join preserves all 137 base records, LocationAddress deduplication working, field extraction quality verified
- Remove deprecated src/ca_biositing/pipeline/etl/extract/samplemetadata.py - Remove old v01/v02 transform files: - src/ca_biositing/pipeline/etl/transform/field_sampling/field_sample.py - src/ca_biositing/pipeline/etl/transform/field_sampling/location_address.py - Remove associated old unit tests: - src/ca_biositing/pipeline/tests/test_field_sample_transform.py - src/ca_biositing/pipeline/tests/test_location_address_transform.py v03 extractors and transforms are now the canonical implementation: - sample_ids, sample_desc, qty_field_storage, producers extractors - field_sample_v03, location_address_v03 transforms - Comprehensive integration test suite in tests/pipeline/
- Create resource_images extract module using factory pattern - Create resource_image transform module with normalization and lineage tracking - Create resource_image load module with upsert pattern - Update resource_information flow with proper dependency ordering - Add ResourceImage to models __init__ exports - Add comprehensive test suite (16 tests, all passing) - All pre-commit checks passed Implements Phase 2 of etl_improvements_plan.md with: - Extract from Google Sheets worksheet '08.0_Resource_images' - Transform with resource name normalization to resource_id - Load with upsert on (resource_id, image_url) unique constraint - Proper ETL lineage tracking and dependency ordering
…nd populate corrrectly
Documentation build overview
33 files changed ·
|
|
Hey @petercarbsmith Peter! The ETL seems to be failing for me at multiple junctures:
County Ag Report tests also fail due to missing modules (county_ag_report_record / county_ag_report_observation in transform/load analysis). Could it be incomplete commit?
I'll take another look tomorrow. Let me know if you want me to resolve the error and upload commit on my end or let you add another commit and try again. Thanks! |
|
Hey @mglbleta. I am guessing the first one is failing because you will have to share your credentials.json email with the Samplemetadata V.03 sheet. Not sure what is up with that county ag report record though. Looks like it is having trouble finding the modules. It's all working for me locally, but I can try to push again and see if it fixes it. |
There was a problem hiding this comment.
Do we want to add these to DB or are they kind of excess? @petercarbsmith
There was a problem hiding this comment.
Oh these were me exploring the data with the AI. Actually I should probably just remove these from the repo also!
There was a problem hiding this comment.
same with this exploration file
There was a problem hiding this comment.
is "explore" part of the ETL or a one off exploration you did?
There was a problem hiding this comment.
Yep one off type thing.
There was a problem hiding this comment.
Hey Peter! Curious for other analyses adding to the data_source table, do we anticipate adding to this one transform/analysis/data_source file?
There was a problem hiding this comment.
This is a good flag and honestly yea this crossed my mind while doing this that it seems like the less scalable way of handling data source. I think we will at some point want to revisit this.
There was a problem hiding this comment.
will we eventually remove the _v03 suffix? I think it's a little non intuitive for long term for database. not a major issue, just a thought
There was a problem hiding this comment.
Hmmm yea I should probably go in and clean this up. Thanks!



📄 Description
This PR consolidates several critical ETL improvements, bug fixes, and refactors that were previously developed in the
etl_fixesbranch. It has been rebased onto the latestupstream/main(commitd961724) to ensure compatibility with recent structural changes to the data portal views and Alembic migration workflow.Key Workstreams Included:
SampleMetadata_v03-BioCirV).inoculum_volume_l), duplicate record prevention, and Strain Normalization.✅ Checklist
pre-commit run --all-filesand all checks pass🔗 Related Issues
Resolves # (Multiple internal work items for ETL cleanup and refactoring)
💡 Type of change
🧪 How to test
1. Run Integration Tests
Expected Results: All 42 tests should pass.
2. Run Quality Checks
Expected Results: All checks should pass.
📝 Notes to reviewers
Implementation Highlights:
Sample_IDs,Sample_Desc,Qty_FieldStorage,Producers). Uses multi-way left-joins to ensure all 137 base sample records are preserved.Place.geoidand handles unit normalizations for crop production.docker-compose.yml(TZ=UTC) and corrected volume mounts.BioConv_Methodsource column, populates a uniquestrainlookup table, and correctly linksFermentationRecord.strain_idduring transformation.Migration Notes:
The PR includes
alembic/versions/bd227e99e006_add_fermentation_method_fields_resource_.pywhich:strain.nameto support idempotent upserts.