feat: Multiple ETL refinements and fixes by petercarbsmith · Pull Request #260 · sustainability-software-lab/ca-biositing

petercarbsmith · 2026-04-13T02:50:10Z

📄 Description

This PR consolidates several critical ETL improvements, bug fixes, and refactors that were previously developed in the etl_fixes branch. It has been rebased onto the latest upstream/main (commit d961724) to ensure compatibility with recent structural changes to the data portal views and Alembic migration workflow.

Key Workstreams Included:

Field Sample ETL v03 Refactor: Migration from the legacy single-worksheet architecture to a robust, multi-worksheet strategy (SampleMetadata_v03-BioCirV).
County Ag Report ETL: Complete pipeline for extracting and loading California county agricultural production data.
Resource Images ETL: New infrastructure for managing and linking images to resource records.
Fermentation & Pretreatment Fixes: Improved normalization, unit handling (e.g., inoculum_volume_l), duplicate record prevention, and Strain Normalization.
Flow Orchestration: Updated master ETL flow to include new pipelines.

✅ Checklist

I ran pre-commit run --all-files and all checks pass
Tests added/updated where needed (42 tests passed)
Docs added/updated if applicable
I have linked the issue this PR closes (if any)

🔗 Related Issues

Resolves # (Multiple internal work items for ETL cleanup and refactoring)

💡 Type of change

Type	Checked?
🐞 Bug fix	[x]
✨ New feature	[x]
📝 Documentation	[ ]
♻️ Refactor	[x]
🛠️ Build/CI	[ ]
Other (explain)	[ ]

🧪 How to test

1. Run Integration Tests

pixi run test tests/pipeline/test_field_sample_v03_integration.py
pixi run test tests/pipeline/test_county_ag_report_etl.py
pixi run test tests/pipeline/test_fermentation_record_etl.py
pixi run test tests/pipeline/test_resource_images_etl.py

Expected Results: All 42 tests should pass.

2. Run Quality Checks

pixi run pre-commit-all

Expected Results: All checks should pass.

📝 Notes to reviewers

Implementation Highlights:

Field Sample v03: Normalizes data across four worksheets (Sample_IDs, Sample_Desc, Qty_FieldStorage, Producers). Uses multi-way left-joins to ensure all 137 base sample records are preserved.
County Ag Report: Maps county names to Place.geoid and handles unit normalizations for crop production.
Prefect Infrastructure: Fixed timezone errors in docker-compose.yml (TZ=UTC) and corrected volume mounts.
Data Integrity: Added empty-dataframe guards in pretreatment transforms and fixed unit normalization case mismatches in fermentation tests.
Strain Normalization: Implemented a robust mapping for fermentation strains. The pipeline now extracts strain names from the BioConv_Method source column, populates a unique strain lookup table, and correctly links FermentationRecord.strain_id during transformation.

Migration Notes:

The PR includes alembic/versions/bd227e99e006_add_fermentation_method_fields_resource_.py which:

Adds the necessary fields for fermentation and resource images.
Adds a unique constraint on strain.name to support idempotent upserts.
Is compatible with the current database head.

- Add two actual migrations (drop incumbents, recreate test view) - Create alembic/AGENTS.md with migration template patterns - Create DATA_PORTAL_VIEWS_REFACTOR.md comprehensive guide - Create Phase 5 next steps plan documenting remaining tasks - All views ready for one-by-one recreation with new modular approach - Readonly user permissions and indexes documented

This commit addresses the fragility of SQLAlchemy-generated migrations when replaying from scratch (teardown→rebuild scenarios). Problem: When SQLAlchemy models are imported at migration replay time, if schema has changed since the migration was created, the view fails to build and breaks the entire migration chain. Solution: Embed raw SQL as immutable strings in migration files. This is the industry-standard pattern (Liquibase, Flyway, major Alembic projects). Changes: 1. alembic/AGENTS.md - UPDATED - Clarified that raw SQL snapshots are the recommended approach - Added section explaining why (teardown→rebuild safety) - Documented both recommended pattern (raw SQL) and legacy pattern (imports) - Updated key patterns section 2. alembic/versions/9e8f7a6b5c4d_drop_incumbent_data_portal_views.py - FIXED - Changed down_revision from '63c0fedd3446' to '60b08397200f' - Resolved alembic multiple heads issue 3. alembic/versions/9e8f7a6b5c4e_recreate_mv_biomass_search_with_raw_sql.py - NEW - Example migration showing raw SQL snapshot pattern - Demonstrates DROP → COMPILE → CREATE → INDEX → GRANT pattern - SQL is embedded as immutable string, not runtime-evaluated 4. alembic/VIEW_SQL_REFERENCE.md - NEW - Reference documentation for all compiled view SQL - Copy from here when creating new migrations - Includes indexes for each view 5. scripts/extract_view_sql.py - NEW - Utility to extract compiled SQL from SQLAlchemy view definitions - Run this when view definitions change and you need to update migrations 6. scripts/generate_raw_sql_migration.py - NEW - Helper script for generating migration templates with raw SQL Key Benefits: - Migrations work on any replay, even with future schema changes - Full audit trail via migration history - Industry-standard approach - No runtime dependency on current SQLAlchemy definitions

- Consolidated migration: 9e8f7a6b5c4f_recreate_remaining_8_views_with_raw_sql.py - Recreates all 8 remaining materialized views with raw SQL snapshots - Single atomic operation (safer than 8 individual migrations) - Follows pattern: DROP → CREATE → INDEX → GRANT - Syntax verified and ready for application - Generator script: scripts/generate_view_migrations.py - Demonstrates automated migration generation approach - Reference for future view migrations if needed All 8 views included in consolidation: - mv_biomass_availability - mv_biomass_composition - mv_biomass_county_production - mv_biomass_sample_stats - mv_biomass_fermentation - mv_biomass_gasification - mv_biomass_pricing - mv_usda_county_production Previous individual migrations cleaned up (now deleted): - 9e8f7a6b5c4d_drop_incumbent_data_portal_views.py - 9e8f7a6b5c4d_recreate_mv_biomass_search_with_modular_approach.py - 9e8f7a6b5c4e_recreate_mv_biomass_search_with_raw_sql.py

Changed production_energy_content_unit_id to energy_content_unit_id to match the actual database schema in billion_ton2023_record table.

PostgreSQL GRANT syntax updated to explicitly grant SELECT on each materialized view individually rather than using bulk ALL syntax. Views granted permissions: - mv_biomass_availability - mv_biomass_composition - mv_biomass_county_production - mv_biomass_sample_stats - mv_biomass_fermentation - mv_biomass_gasification - mv_biomass_pricing - mv_usda_county_production Migration 9e8f7a6b5c4f now applies successfully.

- Added TZ=UTC environment variable to prefect-server and prefect-worker - Added /etc/timezone and /etc/localtime volume mounts for timezone support - Fixes 'whenever.TimeZoneNotFoundError: No time zone found at path /etc/localtime' when running Prefect flows This resolves the issue when attempting to run ETL flows via Prefect CLI.

…R incorporated

…te f98d1a9fe9a7 parent

…/ca-biositing into fix-views_alembic_refactor

- Create comprehensive integration test suite (18 tests covering extract, transform, load) - Add pytest fixtures with realistic mock data (137, 104, 130, 64 rows) - Register flow with run_prefect_flow.py orchestrator - Execute flow with real Google Sheets data - all extractors and transforms successful - Fix critical provider_id population bug: normalize column name 'providercode' (no underscore) - Pass all pre-commit quality checks (linting, formatting, spell check, YAML validation) - Test validation: multi-way join preserves all 137 base records, LocationAddress deduplication working, field extraction quality verified

- Remove deprecated src/ca_biositing/pipeline/etl/extract/samplemetadata.py - Remove old v01/v02 transform files: - src/ca_biositing/pipeline/etl/transform/field_sampling/field_sample.py - src/ca_biositing/pipeline/etl/transform/field_sampling/location_address.py - Remove associated old unit tests: - src/ca_biositing/pipeline/tests/test_field_sample_transform.py - src/ca_biositing/pipeline/tests/test_location_address_transform.py v03 extractors and transforms are now the canonical implementation: - sample_ids, sample_desc, qty_field_storage, producers extractors - field_sample_v03, location_address_v03 transforms - Comprehensive integration test suite in tests/pipeline/

- Create resource_images extract module using factory pattern - Create resource_image transform module with normalization and lineage tracking - Create resource_image load module with upsert pattern - Update resource_information flow with proper dependency ordering - Add ResourceImage to models __init__ exports - Add comprehensive test suite (16 tests, all passing) - All pre-commit checks passed Implements Phase 2 of etl_improvements_plan.md with: - Extract from Google Sheets worksheet '08.0_Resource_images' - Transform with resource name normalization to resource_id - Load with upsert on (resource_id, image_url) unique constraint - Proper ETL lineage tracking and dependency ordering

…nd populate corrrectly

…r container

read-the-docs-community · 2026-04-13T02:53:20Z

Documentation build overview

📚 ca-biositing | 🛠️ Build #32264608 | 📁 Comparing 3a320cb against latest (035a366)

🔍 Preview build

33 files changed · ± 29 modified · - 4 deleted

± Modified

- Deleted

mglbleta · 2026-04-14T01:55:38Z

Hey @petercarbsmith Peter!

The ETL seems to be failing for me at multiple junctures:

County Ag Report tests also fail due to missing modules (county_ag_report_record / county_ag_report_observation in transform/load analysis). Could it be incomplete commit?

I'll take another look tomorrow. Let me know if you want me to resolve the error and upload commit on my end or let you add another commit and try again. Thanks!

petercarbsmith · 2026-04-14T02:57:58Z

Hey @mglbleta. I am guessing the first one is failing because you will have to share your credentials.json email with the Samplemetadata V.03 sheet. Not sure what is up with that county ag report record though. Looks like it is having trouble finding the modules. It's all working for me locally, but I can try to push again and see if it fixes it.

…e present now!

mglbleta

Hey Peter! Finished review. Nothing critical, just some questions from me. You or I can merge after you see my notes.

mglbleta · 2026-04-14T21:33:55Z

exports/sample_metadata_v03_exploration_20260407_165121.json

Do we want to add these to DB or are they kind of excess? @petercarbsmith

Oh these were me exploring the data with the AI. Actually I should probably just remove these from the repo also!

mglbleta · 2026-04-14T21:34:12Z

exports/sample_metadata_v03_exploration_20260407_165121.txt

same with this exploration file

mglbleta · 2026-04-14T21:35:04Z

scripts/explore_sample_metadata_v03.py

is "explore" part of the ETL or a one off exploration you did?

Yep one off type thing.

mglbleta · 2026-04-14T21:50:29Z

src/ca_biositing/pipeline/ca_biositing/pipeline/etl/transform/analysis/data_source.py

Hey Peter! Curious for other analyses adding to the data_source table, do we anticipate adding to this one transform/analysis/data_source file?

This is a good flag and honestly yea this crossed my mind while doing this that it seems like the less scalable way of handling data source. I think we will at some point want to revisit this.

mglbleta · 2026-04-14T21:51:33Z

...iositing/pipeline/ca_biositing/pipeline/etl/transform/field_sampling/location_address_v03.py

will we eventually remove the _v03 suffix? I think it's a little non intuitive for long term for database. not a major issue, just a thought

Hmmm yea I should probably go in and clean this up. Thanks!

petercarbsmith added 30 commits April 3, 2026 19:53

phase 1 refactor creating individual modules for each data_portal_view

74c511e

phase 2 all imports are verfied working

73f8623

fix: Correct column name in mv_biomass_county_production view

90bb531

Changed production_energy_content_unit_id to energy_content_unit_id to match the actual database schema in billion_ton2023_record table.

finally have immutable view and index creation. New tables from Mei P…

967f810

…R incorporated

cleaning up some documentation

d550641

adding qc filtering to views to not include fail results

2f19df1

fixing migration issue with squashed data_portal stuff

c72e37e

Add b3f2d1c8e9a0 api_key table migration in correct sequence and upda…

cc11e75

…te f98d1a9fe9a7 parent

Merge branch 'main' of https://github.com/sustainability-software-lab…

82a28db

…/ca-biositing into fix-views_alembic_refactor

fix: Apply pre-commit formatting corrections

36c5a47

fixing refresh_views issue with no unique constraint on some views

ab72cd9

fixing up some pretreatment etl problems.

e4e753f

phase one of new etl plan. Creates sql models and migrations

e8788b6

final fix to fermentation_record and resource_image. Flows now work a…

109f510

…nd populate corrrectly

feat: etl pipeline for county ag report record buit and working well

9565352

bug: fixing dataset in observation to populate for county reports

268c55a

adding ag report test and turning all the flows back on

4320bd6

fix-fermentation record duplicate issue and mounting volumes to docke…

6743407

…r container

turning back on all flows, fixing county_ag_report

ecd888c

fixing tests for test_fermenetation

2ab2525

feat: consolidate etl_fixes and rebase onto upstream main

0f86863

implementing strain normalization for fermentation_record

fdd7570

petercarbsmith requested a review from mglbleta April 13, 2026 02:50

petercarbsmith temporarily deployed to testpypi April 13, 2026 02:52 — with GitHub Actions Inactive

bug: attempting to fix migrations CI failure

bf884c8

petercarbsmith temporarily deployed to testpypi April 13, 2026 03:16 — with GitHub Actions Inactive

bug: it was a gitignore problem! Sorry about that. Everthing should b…

1a03b6c

…e present now!

petercarbsmith temporarily deployed to testpypi April 14, 2026 21:08 — with GitHub Actions Inactive

mglbleta approved these changes Apr 14, 2026

View reviewed changes

addressing reviewer comments to clean up

3a320cb

petercarbsmith temporarily deployed to testpypi April 14, 2026 22:57 — with GitHub Actions Inactive

petercarbsmith merged commit bd0ec5e into main Apr 14, 2026
22 checks passed

Conversation

petercarbsmith commented Apr 13, 2026

📄 Description

Key Workstreams Included:

✅ Checklist

🔗 Related Issues

💡 Type of change

🧪 How to test

1. Run Integration Tests

2. Run Quality Checks

📝 Notes to reviewers

Implementation Highlights:

Migration Notes:

Uh oh!

read-the-docs-community bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

mglbleta commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petercarbsmith commented Apr 14, 2026

Uh oh!

mglbleta left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

read-the-docs-community bot commented Apr 13, 2026 •

edited

Loading

mglbleta commented Apr 14, 2026 •

edited

Loading

mglbleta left a comment •

edited

Loading