diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 4f210ccd..3a17e040 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -39,8 +39,8 @@ Your contributions make this project betterβ€”thank you for your support! πŸš€ 1. Set up your development environment with `pixi install`. 2. Install pre-commit hooks with `pixi run pre-commit-install`. 3. Create a feature branch. -4. Make your changes and ensure tests and pre-commit checks pass. . Submit a - pull request. +4. Make your changes and ensure tests and pre-commit checks pass. Submit a pull + request. ### Configuring Pre-commit diff --git a/ERD_VIEW.md b/ERD_VIEW.md deleted file mode 100644 index 05b1930f..00000000 --- a/ERD_VIEW.md +++ /dev/null @@ -1,61 +0,0 @@ -# ERD for USDA Census and Survey Data - -To view this diagram, open the Command Palette (`Cmd+Shift+P` on Mac or -`Ctrl+Shift+P` on Windows/Linux) and run **"Markdown: Open Preview to the -Side"**. - -```mermaid -erDiagram -CensusRecord { - integer year - CropEnum crop - VariableEnum variable - UnitEnum unit - float value - BearingStatusEnum bearing_status - string class_desc - string domain_desc - string source - string notes -} -Geography { - string state_name - string state_fips - string county_name - string county_fips - string geoid - string region_name - string agg_level_desc -} -SurveyRecord { - string period_desc - string freq_desc - string program_desc - integer year - CropEnum crop - VariableEnum variable - UnitEnum unit - float value - BearingStatusEnum bearing_status - string class_desc - string domain_desc - string source - string notes -} -UsdaRecord { - integer year - CropEnum crop - VariableEnum variable - UnitEnum unit - float value - BearingStatusEnum bearing_status - string class_desc - string domain_desc - string source - string notes -} - -CensusRecord ||--|o Geography : "geography" -SurveyRecord ||--|o Geography : "geography" -UsdaRecord ||--|o Geography : "geography" -``` diff --git a/anaconda_projects/db/project_filebrowser.db b/anaconda_projects/db/project_filebrowser.db new file mode 100644 index 00000000..3fa3a4a0 Binary files /dev/null and b/anaconda_projects/db/project_filebrowser.db differ diff --git a/docs/ERD_VIEW.md b/docs/ERD_VIEW.md deleted file mode 120000 index 1f7c4c28..00000000 --- a/docs/ERD_VIEW.md +++ /dev/null @@ -1 +0,0 @@ -../ERD_VIEW.md \ No newline at end of file diff --git a/docs/architecture.md b/docs/architecture.md index 03c43411..cd802e57 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -2,12 +2,7 @@ ## Overview -CA-Biositing is a comprehensive geospatial bioeconomy platform for biodiversity -data management and analysis, specifically focused on California biositing -activities. The project combines ETL data pipelines, REST APIs, geospatial -analysis tools, and web interfaces to support biodiversity research and -conservation efforts. It processes data from Google Sheets into PostgreSQL -databases and provides both programmatic and visual access to the data. +The CA-Biositing system ingests agricultural and geospatial data from multiple external sources to support biomass siting analysis and related decision-making workflows. This architecture document describes how data flows through ETL pipelines, is validated and stored in relational and geospatial databases, and is orchestrated using workflow tooling. The diagram below provides a high-level view of the core services, data stores, and integrations that make up the platform. ## System Architecture Diagram @@ -95,7 +90,7 @@ end ### Backend Infrastructure - **Programming Language**: Python 3.12+ -- **Database**: PostgreSQL 13+ with PostGIS extension +- **Database**: PostgreSQL 13+ (17 in dev/staging) with PostGIS extension - **Database Migrations**: Alembic for schema versioning - **Data Models**: SQLModel (combining SQLAlchemy + Pydantic) - **API Framework**: FastAPI with automatic OpenAPI documentation @@ -127,13 +122,15 @@ end ### Cloud Infrastructure & Services -- **Google Cloud Platform**: +- **Google Cloud Platform (GCP):** - Google Sheets API for data ingestion - - Google Cloud credentials management - - Potential cloud deployment target -- **Database Hosting**: Containerized PostgreSQL (development), cloud SQL - (production) -- **Container Registry**: For Docker image distribution + - Google Cloud Secret Manager for credentials +- **Production deployment:** All core infrastructure (database, application + containers, orchestration, and secrets) runs on GCP using Cloud SQL, Cloud + Run, Artifact Registry, and Secret Manager +- **Database Hosting:** PostgreSQL 17+ with PostGIS (Cloud SQL on GCP for + production, local PostGIS for development) +- **Container Registry:** GCP Artifact Registry for Docker images ## Detailed Project Structure @@ -319,6 +316,10 @@ subdirectories (91 models total). Four base mixins (`BaseEntity`, `LookupBase`, #### Resource & Biomass Models (`resource_information/`) + + - **Resource**: Core biomass resource definitions - **ResourceClass**, **ResourceSubclass**: Hierarchical resource classification - **ResourceAvailability**: Seasonal and quantitative availability data @@ -509,6 +510,10 @@ Environments: ## Deployment & Operations + + ### Container Orchestration - **Development**: Docker Compose for local services diff --git a/docs/notebook_setup.md b/docs/notebook_setup.md index d65d6b2d..ab2f8936 100644 --- a/docs/notebook_setup.md +++ b/docs/notebook_setup.md @@ -1,6 +1,6 @@ # Notebook Setup Guide for **ca-biositing** -**Purpose** -- Set up Jupyter notebooks with correct imports for the PEP 420 +**Purpose**: Set up Jupyter notebooks with correct imports for the PEP 420 namespace packages used in this repository. --- diff --git a/docs/pipeline/ALEMBIC_WORKFLOW.md b/docs/pipeline/ALEMBIC_WORKFLOW.md index ff048753..dd290239 100644 --- a/docs/pipeline/ALEMBIC_WORKFLOW.md +++ b/docs/pipeline/ALEMBIC_WORKFLOW.md @@ -12,9 +12,11 @@ systematic and version-controlled way. allows you to modify your database schema (e.g., add a new table or column) and keep a versioned history of those changes. - **Why use it?** It prevents you from having to manually write SQL - `ALTER TABLE` statements. It automatically compares your SQLModel classes to - the current state of the database and generates the necessary migration - scripts. + `ALTER TABLE` statements which are not tracked in version control. Alembic + generates SQL code from the Python SQLModel schema to prevent manual errors. + It also automatically compares your SQLModel classes to the current state of + the database and generates the necessary migration scripts. This reduces + database drift. --- diff --git a/docs/pipeline/ETL_WORKFLOW.md b/docs/pipeline/ETL_WORKFLOW.md index b04f34e8..b6e15b63 100644 --- a/docs/pipeline/ETL_WORKFLOW.md +++ b/docs/pipeline/ETL_WORKFLOW.md @@ -17,8 +17,9 @@ and loads it into the PostgreSQL database. - `load`: Functions to insert the transformed data into the database using SQLAlchemy. -- **Hierarchical Pipelines:** Individual pipelines are nested within - subdirectories reflecting the data they handle (e.g., `products`, `biomass`). +- **Hierarchical Pipelines:** Transform and load logic are organized into + subdirectories reflecting the data they handle (e.g., `products`, `usda`, + `analysis`). --- @@ -32,19 +33,25 @@ The ETL system runs in a containerized Prefect environment. pixi run start-services ``` -**Step 2: Deploy Flows** +**Step 2: Apply Datamodel** + +```bash +pixi run migrate +``` + +**Step 3: Deploy Flows** ```bash pixi run deploy ``` -**Step 3: Run the Master Pipeline** +**Step 4: Run the Master Pipeline** ```bash pixi run run-etl ``` -**Step 4: Monitor** Access the Prefect UI at +**Step 5: Monitor** Access the Prefect UI at [http://localhost:4200](http://localhost:4200). --- @@ -52,21 +59,23 @@ pixi run run-etl ### How to Add a New ETL Flow **Step 1: Create the Task Files** Create the three Python files for your -extract, transform, and load logic in the appropriate subdirectories under -`src/ca_biositing/pipeline/ca_biositing/pipeline/etl/`. Decorate each function -with `@task`. +extract, transform, and load logic under +`src/ca_biositing/pipeline/ca_biositing/pipeline/etl/`. Extract tasks go +directly in `extract/`; transform and load tasks go in appropriately named +subdirectories (e.g., `transform/products/`, `load/products/`). Decorate each +function with `@task`. **Step 2: Create the Pipeline Flow** Create a new file in `src/ca_biositing/pipeline/ca_biositing/pipeline/flows/` to define the flow. ```python from prefect import flow -from ca_biositing.pipeline.etl.extract.samples.new_type import extract -from ca_biositing.pipeline.etl.transform.samples.new_type import transform -from ca_biositing.pipeline.etl.load.samples.new_type import load +from ca_biositing.pipeline.etl.extract.my_source import extract +from ca_biositing.pipeline.etl.transform.products.my_product import transform +from ca_biositing.pipeline.etl.load.products.my_product import load @flow -def new_type_flow(): +def my_product_flow(): raw_data = extract() transformed_data = transform(raw_data) load(transformed_data) diff --git a/mkdocs.yml b/mkdocs.yml index c96fcb90..03db49b3 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,5 +1,5 @@ site_name: CA-BioSiting -repo_url: https://github.com/uw-ssec/ca-biositing +repo_url: https://github.com/sustainability-software-lab/ca-biositing theme: name: material @@ -51,11 +51,11 @@ nav: - Notebook Setup: notebook_setup.md - Pipeline: - Overview: pipeline/README.md + - GCP Setup: pipeline/GCP_SETUP.md - ETL Workflow: pipeline/ETL_WORKFLOW.md - Alembic Workflow: pipeline/ALEMBIC_WORKFLOW.md - Docker Workflow: pipeline/DOCKER_WORKFLOW.md - Prefect Workflow: pipeline/PREFECT_WORKFLOW.md - - GCP Setup: pipeline/GCP_SETUP.md - USDA ETL Guide: pipeline/USDA/USDA_ETL_GUIDE.md - Datamodels: - Overview: datamodels/README.md @@ -71,4 +71,3 @@ nav: - Deployment: deployment/README.md - Contributing: CONTRIBUTING.md - Code of Conduct: CODE_OF_CONDUCT.md - - ERD View: ERD_VIEW.md diff --git a/pixi.lock b/pixi.lock index cff1b773..eb952c35 100644 --- a/pixi.lock +++ b/pixi.lock @@ -5,8 +5,6 @@ environments: - url: https://conda.anaconda.org/conda-forge/ indexes: - https://pypi.org/simple - options: - pypi-prerelease-mode: if-necessary-or-explicit packages: linux-64: - conda: https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2 @@ -1709,8 +1707,6 @@ environments: - url: https://conda.anaconda.org/conda-forge/ indexes: - https://pypi.org/simple - options: - pypi-prerelease-mode: if-necessary-or-explicit packages: linux-64: - conda: https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2 @@ -3071,8 +3067,6 @@ environments: - url: https://conda.anaconda.org/conda-forge/ indexes: - https://pypi.org/simple - options: - pypi-prerelease-mode: if-necessary-or-explicit packages: linux-64: - conda: https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2 @@ -3466,8 +3460,6 @@ environments: - url: https://conda.anaconda.org/conda-forge/ indexes: - https://pypi.org/simple - options: - pypi-prerelease-mode: if-necessary-or-explicit packages: linux-64: - conda: https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2 @@ -4860,8 +4852,6 @@ environments: frontend: channels: - url: https://conda.anaconda.org/conda-forge/ - options: - pypi-prerelease-mode: if-necessary-or-explicit packages: linux-64: - conda: https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2 @@ -4917,8 +4907,6 @@ environments: - url: https://conda.anaconda.org/conda-forge/ indexes: - https://pypi.org/simple - options: - pypi-prerelease-mode: if-necessary-or-explicit packages: linux-64: - conda: https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2 @@ -7217,8 +7205,6 @@ environments: - url: https://conda.anaconda.org/conda-forge/ indexes: - https://pypi.org/simple - options: - pypi-prerelease-mode: if-necessary-or-explicit packages: linux-64: - conda: https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2 @@ -8654,8 +8640,6 @@ environments: - url: https://conda.anaconda.org/conda-forge/ indexes: - https://pypi.org/simple - options: - pypi-prerelease-mode: if-necessary-or-explicit packages: linux-64: - conda: https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2 @@ -10084,8 +10068,6 @@ environments: - url: https://conda.anaconda.org/conda-forge/ indexes: - https://pypi.org/simple - options: - pypi-prerelease-mode: if-necessary-or-explicit packages: linux-64: - conda: https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2 @@ -14518,6 +14500,7 @@ packages: - sqlalchemy>=2.0.0 - sqlmodel>=0.0.19,<0.1 requires_python: '>=3.12' + editable: true - pypi: ./src/ca_biositing/pipeline name: ca-biositing-pipeline version: 0.1.0 @@ -14535,6 +14518,7 @@ packages: - pyogrio - python-dotenv>=1.0.1,<2 requires_python: '>=3.12' + editable: true - pypi: ./src/ca_biositing/webservice name: ca-biositing-webservice version: 0.1.0 @@ -14548,6 +14532,7 @@ packages: - python-multipart>=0.0.9 - uvicorn>=0.30.0,<1 requires_python: '>=3.12' + editable: true - conda: https://conda.anaconda.org/conda-forge/noarch/ca-certificates-2025.10.5-hbd8a1cb_0.conda sha256: 3b5ad78b8bb61b6cdc0978a6a99f8dfb2cc789a451378d054698441005ecbdb6 md5: f9e5fbc24009179e8b0409624691758a diff --git a/src/ca_biositing/datamodels/README.md b/src/ca_biositing/datamodels/README.md index b5480e8d..532114b5 100644 --- a/src/ca_biositing/datamodels/README.md +++ b/src/ca_biositing/datamodels/README.md @@ -9,11 +9,12 @@ etc.). The `ca_biositing.datamodels` package provides: -- **Hand-Written SQLModel Classes**: 91 models organized across 15 domain +- **Hand-Written SQLModel Classes**: Models organized across domain subdirectories, combining SQLAlchemy ORM and Pydantic validation in a single class hierarchy. -- **Materialized Views**: 7 analytical views defined as SQLAlchemy Core - `select()` expressions, managed via Alembic migrations. +- **Materialized Views**: Analytical views defined as SQLAlchemy Core `select()` + expressions (plus one SQL-based aggregate view), managed via Alembic + migrations. - **Database Configuration**: SQLModel-based engine and session management with Docker-aware URL adjustment. - **Model Configuration**: Shared configuration for model behavior using @@ -79,12 +80,13 @@ src/ca_biositing/datamodels/ β”‚ β”œβ”€β”€ __init__.py # Package initialization and version β”‚ β”œβ”€β”€ config.py # Model configuration (Pydantic Settings) β”‚ β”œβ”€β”€ database.py # SQLModel engine and session management -β”‚ β”œβ”€β”€ views.py # Materialized view definitions (7 views) +β”‚ β”œβ”€β”€ views.py # Materialized view definitions β”‚ β”œβ”€β”€ models/ # Hand-written SQLModel classes -β”‚ β”‚ β”œβ”€β”€ __init__.py # Central re-export of all 91 models +β”‚ β”‚ β”œβ”€β”€ __init__.py # Central re-export of models β”‚ β”‚ β”œβ”€β”€ base.py # Base classes (BaseEntity, LookupBase, etc.) β”‚ β”‚ β”œβ”€β”€ aim1_records/ # Aim 1 analytical records β”‚ β”‚ β”œβ”€β”€ aim2_records/ # Aim 2 processing records +β”‚ β”‚ β”œβ”€β”€ auth/ # API users and authentication models β”‚ β”‚ β”œβ”€β”€ core/ # ETL lineage and run tracking β”‚ β”‚ β”œβ”€β”€ data_sources_metadata/ # Data source and dataset metadata β”‚ β”‚ β”œβ”€β”€ experiment_equipment/ # Experiments and equipment @@ -93,7 +95,6 @@ src/ca_biositing/datamodels/ β”‚ β”‚ β”œβ”€β”€ general_analysis/ # Observations and analysis types β”‚ β”‚ β”œβ”€β”€ infrastructure/ # Infrastructure facility records β”‚ β”‚ β”œβ”€β”€ methods_parameters_units/ # Methods, parameters, units -β”‚ β”‚ β”œβ”€β”€ misc/ # Additional infrastructure models β”‚ β”‚ β”œβ”€β”€ people/ # Contacts and providers β”‚ β”‚ β”œβ”€β”€ places/ # Location and address models β”‚ β”‚ β”œβ”€β”€ resource_information/ # Resources, availability, strains @@ -102,8 +103,6 @@ src/ca_biositing/datamodels/ β”œβ”€β”€ tests/ β”‚ β”œβ”€β”€ __init__.py β”‚ β”œβ”€β”€ conftest.py # Pytest fixtures and configuration -β”‚ β”œβ”€β”€ test_biomass.py # Tests for biomass models -β”‚ β”œβ”€β”€ test_geographic_locations.py # Tests for location models β”‚ β”œβ”€β”€ test_package.py # Tests for package metadata β”‚ └── README.md # Test documentation β”œβ”€β”€ LICENSE # BSD License @@ -208,7 +207,7 @@ pixi run pytest src/ca_biositing/datamodels -v ### Run specific test files ```bash -pixi run pytest src/ca_biositing/datamodels/tests/test_biomass.py -v +pixi run pytest src/ca_biositing/datamodels/tests/test_package.py -v ``` ### Run with coverage @@ -221,18 +220,17 @@ See `tests/README.md` for detailed information about the test suite. ## Model Categories -The models are organized into 15 domain subdirectories under `models/`: +The models are organized into domain subdirectories under `models/`: ### Core and Infrastructure - **`base.py`**: Base classes shared across all models (`BaseEntity`, `LookupBase`, `Aim1RecordBase`, `Aim2RecordBase`). +- **`auth/`**: API authentication models (`ApiUser`). - **`core/`**: ETL run tracking and lineage (`EtlRun`, `EntityLineage`, `LineageGroup`). - **`infrastructure/`**: Infrastructure facility records (biodiesel plants, landfills, ethanol biorefineries, etc.). -- **`misc/`**: Additional infrastructure models (MSW digesters, SAF plants, - wastewater treatment). - **`places/`**: Location and address models (`Place`, `LocationAddress`, `LocationResolution`). - **`people/`**: Contact and provider information (`Contact`, `Provider`). @@ -312,7 +310,7 @@ pixi run pre-commit run --files src/ca_biositing/datamodels/**/* - **Version**: 0.1.0 - **Python**: >= 3.12 - **License**: BSD License -- **Repository**: +- **Repository**: ## Contributing diff --git a/src/ca_biositing/pipeline/README.md b/src/ca_biositing/pipeline/README.md index 56b8bc31..444ac111 100644 --- a/src/ca_biositing/pipeline/README.md +++ b/src/ca_biositing/pipeline/README.md @@ -326,7 +326,7 @@ following the project conventions. - **Version**: 0.1.0 - **Python**: >= 3.12 - **License**: BSD License -- **Repository**: +- **Repository**: ## Contributing