feat: ETL Pipeline Notebooks and Fix LinkML SQLAlchemy generation by petercarbsmith · Pull Request #18 · petercarbsmith/ca-biositing

petercarbsmith · 2026-01-07T21:35:11Z

📄 Description

Key Changes

🔧 LinkML/SQLAlchemy Model Generation Fixes

Fixed Base metadata inheritance: Resolved issues where generated SQLAlchemy models weren't properly inheriting from the shared Base class, preventing Alembic from detecting table changes
Corrected ForeignKey references: Updated post-processing to convert class name references (e.g., 'Resource.id') to proper table name references (e.g., 'resource.id') in snake_case format
Enhanced migration compatibility: Models now properly register with SQLAlchemy metadata, enabling automatic Alembic migration generation
Improved schema generation: Modified generate_sqla.py to handle any ForeignKey column references, not just .id columns

📊 ETL Pipeline Improvements & New Notebooks

New Interactive ETL Notebooks

etl_notebook.ipynb: Clean, streamlined version of the Google Sheets extraction workflow with improved error handling and data validation
gsheet_extraction_notebook.ipynb: Comprehensive notebook containing additional helper functions for:
- Advanced Google Sheets API interactions
- Data transformation utilities
- Error recovery mechanisms
- Batch processing capabilities

ETL Pipeline Enhancements

Primary agricultural product pipeline: Fixed and enhanced the core ETL workflow for primary_ag_product
Data extraction workflows: Improved Google Sheets integration with better authentication handling
Database interaction utilities: New functions for efficient data insertion and lookup operations
Development testing tools: Added notebooks for testing ETL components in isolation

FOR THE MOST USEFUL BIT @mglbleta and @avi9664, please have a look at the etl_notebook. In essence it does a couple operations.

extracts gsheet data into pandas. However, you will need to make an extract.py script and import it for other sheets. See the extract template for how to make this
cleans the data frames - It makes columns lower_case, coerces data types (for other sheets this may need to be modified), and gets rid of missing data
replaces names with id - for normalization, this takes a value, checks to see if it is already in the lookup table. If yes, then it gives you the id, if no, it creates an entry in the table and returns the id. This will then populate many tables at one
returns the normalized data frames.

My hope this can be used as a model for how to handle everything from extract to transform. It also serves as an example of how to do imports and work within a Jupyter notebook, which will be the most efficient way to see the results of your code without having to rebuild the containers constantly. Eventually, we will transition code out of the notebooks and into .py modules for production.

🏗️ Infrastructure & Database Configuration

Database configuration: Added database.py and config.py to the datamodels package for centralized database management
Schema expansion: Extended LinkML schema to include infrastructure models (processing facilities, energy systems, etc.)
Local development setup: Configured database connections for localhost development environment

📝 Development Experience

Interactive data exploration: New notebooks serve as both development tools and documentation for ETL processes
Testing utilities: Enhanced lookup function testing and validation
Container development: Attempted improvements to development container setup (kernel issues noted for future resolution)

Technical Implementation

Model Generation Fixes

# Before: Models used separate Base instances
Base = declarative_base()  # Created new metadata instance

# After: All models share the same Base
from ...database import Base  # Uses shared metadata instance

## ✅ Checklist

- [x] I ran `pre-commit run --all-files` and all checks pass
- [ ] Tests added/updated where needed
- [ ] Docs added/updated if applicable (not yet!)
- [ ] I have linked the issue this PR closes (if any)

## 🔗 Related Issues

Resolves #\<issue-number>

## 💡 Type of change

| Type             | Checked? |
| ---------------- | -------- |
| 🐞 Bug fix       | [ ]      |
| ✨ New feature   | [ ]      |
| 📝 Documentation | [ ]      |
| ♻️ Refactor      | [ ]      |
| 🛠️ Build/CI      | [ ]      |
| Other (explain)  | [ ]      |

## 🧪 How to test

See the readme, but try running the pipeline, it should work now.

## 📝 Notes to reviewers

Have a look, but we can discuss more next week!

…d docs

…ng around

…kup insertion.

…ade new ETL notebook and modified some of the gsheet extraction notebook

…. Committing before that"

avi9664 · 2026-01-08T01:03:31Z

Hi! When I'm trying to migrate I'm unable to access these:

(I changed a couple of them to test)

avi9664 · 2026-01-09T07:03:29Z

Hi! When I'm trying to migrate I'm unable to access these:

(I changed a couple of them to test)

Fixed! Just had to comment out from ca_biositing.datamodels.database import Base

avi9664 · 2026-01-09T07:06:16Z

-# --- Import generated models and their metadata ---
-# from ca_biositing.datamodels.schemas.generated.census_survey import metadata as census_metadata
-# from ca_biositing.datamodels.schemas.generated.geography import metadata as geography_metadata
+from ca_biositing.datamodels.database import Base


note: migration only works for me if this is commented out. that's just a me thing though, don't know if this applies to others

avi9664

I checked the ETL pipeline part of it and it's good!

avi9664

Actually, I'm encountering problems accessing the database in the .ipynb files - I'll test them out within the next few days and let you know what's happening

avi9664 · 2026-01-11T04:46:08Z

Hi! I ran the pipeline notebooks and they couldn't access the database because the DATABASE_URL that was used didn't work with mine, so I changed engine.py and a bit of gsheet_extraction_notebook.ipynb to get credentials from your .env file. So I think it should work for everyone now regardless of what you put in your .env? Please let me know if it doesn't.

Some of the code in the latter half of gsheet_extraction_notebook.ipynb threw some errors (which I included in the latest commit), but I assumed that they were helper functions and left them alone. Let me know if you want me to fix them though! I'd be down :)

petercarbsmith added 9 commits December 22, 2025 14:55

adding comment to init.py to try to solve CI CD issues

5574630

fixed primary_ag_product etl pipeline, modified master flow script an…

e0947fc

…d docs

holiday work. New notebook for importing gsheeets to pandas and playi…

fd235be

…ng around

feat: notebook for gsheets extraction, data playground, and pk id loo…

907686b

…kup insertion.

modifying name_id_swap_function

5e961aa

was trying to fix dev container kernel issue. Did not succeed. Also m…

6d7990b

…ade new ETL notebook and modified some of the gsheet extraction notebook

modified etl_notebook. Db is now running on localhost

f4b06ec

"did the module refactor but may need to mess with the alembic env.py…

5f1e0a4

…. Committing before that"

fixed the alembic import problems. models now have fk

234a235

petercarbsmith requested review from avi9664 and mglbleta January 7, 2026 21:35

avi9664 reviewed Jan 9, 2026

View reviewed changes

avi9664 approved these changes Jan 9, 2026

View reviewed changes

avi9664 requested changes Jan 9, 2026

View reviewed changes

modified engine.py to get correct database url from env

b08f15e

petercarbsmith merged commit 7ab3102 into main Jan 16, 2026
1 of 7 checks passed

petercarbsmith deleted the Peter-linkml-refactor branch January 16, 2026 03:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ETL Pipeline Notebooks and Fix LinkML SQLAlchemy generation#18

feat: ETL Pipeline Notebooks and Fix LinkML SQLAlchemy generation#18
petercarbsmith merged 10 commits intomainfrom
Peter-linkml-refactor

petercarbsmith commented Jan 7, 2026

Uh oh!

avi9664 commented Jan 8, 2026 •

edited

Loading

Uh oh!

avi9664 commented Jan 9, 2026

Uh oh!

avi9664 Jan 9, 2026 •

edited

Loading

Uh oh!

avi9664 left a comment

Uh oh!

avi9664 left a comment •

edited

Loading

Uh oh!

avi9664 commented Jan 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

petercarbsmith commented Jan 7, 2026

📄 Description

Key Changes

🔧 LinkML/SQLAlchemy Model Generation Fixes

📊 ETL Pipeline Improvements & New Notebooks

New Interactive ETL Notebooks

ETL Pipeline Enhancements

🏗️ Infrastructure & Database Configuration

📝 Development Experience

Technical Implementation

Model Generation Fixes

Uh oh!

avi9664 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avi9664 commented Jan 9, 2026

Uh oh!

avi9664 Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avi9664 left a comment

Choose a reason for hiding this comment

Uh oh!

avi9664 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avi9664 commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

avi9664 commented Jan 8, 2026 •

edited

Loading

avi9664 Jan 9, 2026 •

edited

Loading

avi9664 left a comment •

edited

Loading

avi9664 commented Jan 11, 2026 •

edited

Loading