Skip to content

feat: ETL Pipeline Notebooks and Fix LinkML SQLAlchemy generation#18

Merged
petercarbsmith merged 10 commits intomainfrom
Peter-linkml-refactor
Jan 16, 2026
Merged

feat: ETL Pipeline Notebooks and Fix LinkML SQLAlchemy generation#18
petercarbsmith merged 10 commits intomainfrom
Peter-linkml-refactor

Conversation

@petercarbsmith
Copy link
Copy Markdown
Owner

📄 Description

Key Changes

🔧 LinkML/SQLAlchemy Model Generation Fixes

  • Fixed Base metadata inheritance: Resolved issues where generated SQLAlchemy models weren't properly inheriting from the shared Base class, preventing Alembic from detecting table changes
  • Corrected ForeignKey references: Updated post-processing to convert class name references (e.g., 'Resource.id') to proper table name references (e.g., 'resource.id') in snake_case format
  • Enhanced migration compatibility: Models now properly register with SQLAlchemy metadata, enabling automatic Alembic migration generation
  • Improved schema generation: Modified generate_sqla.py to handle any ForeignKey column references, not just .id columns

📊 ETL Pipeline Improvements & New Notebooks

New Interactive ETL Notebooks

  • etl_notebook.ipynb: Clean, streamlined version of the Google Sheets extraction workflow with improved error handling and data validation
  • gsheet_extraction_notebook.ipynb: Comprehensive notebook containing additional helper functions for:
    • Advanced Google Sheets API interactions
    • Data transformation utilities
    • Error recovery mechanisms
    • Batch processing capabilities

ETL Pipeline Enhancements

  • Primary agricultural product pipeline: Fixed and enhanced the core ETL workflow for primary_ag_product
  • Data extraction workflows: Improved Google Sheets integration with better authentication handling
  • Database interaction utilities: New functions for efficient data insertion and lookup operations
  • Development testing tools: Added notebooks for testing ETL components in isolation

FOR THE MOST USEFUL BIT @mglbleta and @avi9664, please have a look at the etl_notebook. In essence it does a couple operations.

  1. extracts gsheet data into pandas. However, you will need to make an extract.py script and import it for other sheets. See the extract template for how to make this
  2. cleans the data frames - It makes columns lower_case, coerces data types (for other sheets this may need to be modified), and gets rid of missing data
  3. replaces names with id - for normalization, this takes a value, checks to see if it is already in the lookup table. If yes, then it gives you the id, if no, it creates an entry in the table and returns the id. This will then populate many tables at one
  4. returns the normalized data frames.

My hope this can be used as a model for how to handle everything from extract to transform. It also serves as an example of how to do imports and work within a Jupyter notebook, which will be the most efficient way to see the results of your code without having to rebuild the containers constantly. Eventually, we will transition code out of the notebooks and into .py modules for production.

🏗️ Infrastructure & Database Configuration

  • Database configuration: Added database.py and config.py to the datamodels package for centralized database management
  • Schema expansion: Extended LinkML schema to include infrastructure models (processing facilities, energy systems, etc.)
  • Local development setup: Configured database connections for localhost development environment

📝 Development Experience

  • Interactive data exploration: New notebooks serve as both development tools and documentation for ETL processes
  • Testing utilities: Enhanced lookup function testing and validation
  • Container development: Attempted improvements to development container setup (kernel issues noted for future resolution)

Technical Implementation

Model Generation Fixes

# Before: Models used separate Base instances
Base = declarative_base()  # Created new metadata instance

# After: All models share the same Base
from ...database import Base  # Uses shared metadata instance

## ✅ Checklist

- [x] I ran `pre-commit run --all-files` and all checks pass
- [ ] Tests added/updated where needed
- [ ] Docs added/updated if applicable (not yet!)
- [ ] I have linked the issue this PR closes (if any)

## 🔗 Related Issues

Resolves #\<issue-number>

## 💡 Type of change

| Type             | Checked? |
| ---------------- | -------- |
| 🐞 Bug fix       | [ ]      |
|New feature   | [ ]      |
| 📝 Documentation | [ ]      |
| ♻️ Refactor      | [ ]      |
| 🛠️ Build/CI      | [ ]      |
| Other (explain)  | [ ]      |

## 🧪 How to test

See the readme, but try running the pipeline, it should work now.

## 📝 Notes to reviewers

Have a look, but we can discuss more next week!

@avi9664
Copy link
Copy Markdown
Collaborator

avi9664 commented Jan 8, 2026

Hi! When I'm trying to migrate I'm unable to access these:
image

(I changed a couple of them to test)

@avi9664
Copy link
Copy Markdown
Collaborator

avi9664 commented Jan 9, 2026

Hi! When I'm trying to migrate I'm unable to access these: image

(I changed a couple of them to test)

Fixed! Just had to comment out from ca_biositing.datamodels.database import Base

Comment thread alembic/env.py Outdated
# --- Import generated models and their metadata ---
# from ca_biositing.datamodels.schemas.generated.census_survey import metadata as census_metadata
# from ca_biositing.datamodels.schemas.generated.geography import metadata as geography_metadata
from ca_biositing.datamodels.database import Base
Copy link
Copy Markdown
Collaborator

@avi9664 avi9664 Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: migration only works for me if this is commented out. that's just a me thing though, don't know if this applies to others

Copy link
Copy Markdown
Collaborator

@avi9664 avi9664 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the ETL pipeline part of it and it's good!

Copy link
Copy Markdown
Collaborator

@avi9664 avi9664 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm encountering problems accessing the database in the .ipynb files - I'll test them out within the next few days and let you know what's happening

@avi9664
Copy link
Copy Markdown
Collaborator

avi9664 commented Jan 11, 2026

Hi! I ran the pipeline notebooks and they couldn't access the database because the DATABASE_URL that was used didn't work with mine, so I changed engine.py and a bit of gsheet_extraction_notebook.ipynb to get credentials from your .env file. So I think it should work for everyone now regardless of what you put in your .env? Please let me know if it doesn't.

Some of the code in the latter half of gsheet_extraction_notebook.ipynb threw some errors (which I included in the latest commit), but I assumed that they were helper functions and left them alone. Let me know if you want me to fix them though! I'd be down :)

@petercarbsmith petercarbsmith merged commit 7ab3102 into main Jan 16, 2026
1 of 7 checks passed
@petercarbsmith petercarbsmith deleted the Peter-linkml-refactor branch January 16, 2026 03:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants