Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ Your contributions make this project better—thank you for your support! 🚀
1. Set up your development environment with `pixi install`.
2. Install pre-commit hooks with `pixi run pre-commit-install`.
3. Create a feature branch.
4. Make your changes and ensure tests and pre-commit checks pass. . Submit a
pull request.
4. Make your changes and ensure tests and pre-commit checks pass. Submit a pull
request.

### Configuring Pre-commit

Expand Down
61 changes: 0 additions & 61 deletions ERD_VIEW.md

This file was deleted.

Binary file added anaconda_projects/db/project_filebrowser.db
Binary file not shown.
1 change: 0 additions & 1 deletion docs/ERD_VIEW.md

This file was deleted.

31 changes: 18 additions & 13 deletions docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,7 @@

Comment thread
mglbleta marked this conversation as resolved.
## Overview

CA-Biositing is a comprehensive geospatial bioeconomy platform for biodiversity
data management and analysis, specifically focused on California biositing
activities. The project combines ETL data pipelines, REST APIs, geospatial
analysis tools, and web interfaces to support biodiversity research and
conservation efforts. It processes data from Google Sheets into PostgreSQL
databases and provides both programmatic and visual access to the data.
The CA-Biositing system ingests agricultural and geospatial data from multiple external sources to support biomass siting analysis and related decision-making workflows. This architecture document describes how data flows through ETL pipelines, is validated and stored in relational and geospatial databases, and is orchestrated using workflow tooling. The diagram below provides a high-level view of the core services, data stores, and integrations that make up the platform.

## System Architecture Diagram

Expand Down Expand Up @@ -95,7 +90,7 @@ end
### Backend Infrastructure

- **Programming Language**: Python 3.12+
- **Database**: PostgreSQL 13+ with PostGIS extension
- **Database**: PostgreSQL 13+ (17 in dev/staging) with PostGIS extension
Comment thread
mglbleta marked this conversation as resolved.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Database version inconsistency with line 127-128.

Line 89 states PostgreSQL 17 is used in "dev/staging", but lines 127-128 state "PostgreSQL 17+ with PostGIS (Cloud SQL on GCP for production, local PostGIS for development)". This creates confusion about which environments use which PostgreSQL versions.

Please clarify:

  • Is PostgreSQL 17+ used in production (as line 127 suggests)?
  • Is PostgreSQL 13+ still the minimum for development?
  • What version is used in dev/staging vs production?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/architecture.md` at line 89, Update the inconsistent database version
wording so the "Database" bullet (the line starting with "**Database**:
PostgreSQL 13+ (17 in dev/staging) with PostGIS extension") and the later
sentence that reads "PostgreSQL 17+ with PostGIS (Cloud SQL on GCP for
production, local PostGIS for development)" convey the same environment mapping;
explicitly state which PostgreSQL major version is used for production, staging,
and development (e.g., "Production: PostgreSQL 17+ on Cloud SQL with PostGIS;
Staging: PostgreSQL 17; Development: PostgreSQL 13+ with local PostGIS") and
replace one of the conflicting lines with that definitive mapping so both the
"**Database**" bullet and the "PostgreSQL 17+ with PostGIS..." sentence match.

- **Database Migrations**: Alembic for schema versioning
- **Data Models**: SQLModel (combining SQLAlchemy + Pydantic)
- **API Framework**: FastAPI with automatic OpenAPI documentation
Expand Down Expand Up @@ -127,13 +122,15 @@ end

### Cloud Infrastructure & Services

- **Google Cloud Platform**:
- **Google Cloud Platform (GCP):**
- Google Sheets API for data ingestion
- Google Cloud credentials management
- Potential cloud deployment target
- **Database Hosting**: Containerized PostgreSQL (development), cloud SQL
(production)
- **Container Registry**: For Docker image distribution
- Google Cloud Secret Manager for credentials
- **Production deployment:** All core infrastructure (database, application
containers, orchestration, and secrets) runs on GCP using Cloud SQL, Cloud
Run, Artifact Registry, and Secret Manager
- **Database Hosting:** PostgreSQL 17+ with PostGIS (Cloud SQL on GCP for
production, local PostGIS for development)
- **Container Registry:** GCP Artifact Registry for Docker images

## Detailed Project Structure

Expand Down Expand Up @@ -319,6 +316,10 @@ subdirectories (91 models total). Four base mixins (`BaseEntity`, `LookupBase`,

#### Resource & Biomass Models (`resource_information/`)

<!--
TODO (2026-03-12): The "Core Domain Models" section below may be outdated. Review for accuracy in the next documentation update.
-->

- **Resource**: Core biomass resource definitions
- **ResourceClass**, **ResourceSubclass**: Hierarchical resource classification
- **ResourceAvailability**: Seasonal and quantitative availability data
Expand Down Expand Up @@ -509,6 +510,10 @@ Environments:

## Deployment & Operations

<!--
TODO (2026-03-12): Could change section to be less heavily bullet point reliant and use more descriptive language, with greater explanation of future architecture considerations as well.
-->

### Container Orchestration

- **Development**: Docker Compose for local services
Expand Down
2 changes: 1 addition & 1 deletion docs/notebook_setup.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Notebook Setup Guide for **ca-biositing**

**Purpose** -- Set up Jupyter notebooks with correct imports for the PEP 420
**Purpose**: Set up Jupyter notebooks with correct imports for the PEP 420
namespace packages used in this repository.

---
Expand Down
8 changes: 5 additions & 3 deletions docs/pipeline/ALEMBIC_WORKFLOW.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,11 @@ systematic and version-controlled way.
allows you to modify your database schema (e.g., add a new table or column)
and keep a versioned history of those changes.
- **Why use it?** It prevents you from having to manually write SQL
`ALTER TABLE` statements. It automatically compares your SQLModel classes to
the current state of the database and generates the necessary migration
scripts.
`ALTER TABLE` statements which are not tracked in version control. Alembic
generates SQL code from the Python SQLModel schema to prevent manual errors.
It also automatically compares your SQLModel classes to the current state of
the database and generates the necessary migration scripts. This reduces
database drift.

---

Expand Down
33 changes: 21 additions & 12 deletions docs/pipeline/ETL_WORKFLOW.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,9 @@ and loads it into the PostgreSQL database.
- `load`: Functions to insert the transformed data into the database using
SQLAlchemy.

- **Hierarchical Pipelines:** Individual pipelines are nested within
subdirectories reflecting the data they handle (e.g., `products`, `biomass`).
- **Hierarchical Pipelines:** Transform and load logic are organized into
subdirectories reflecting the data they handle (e.g., `products`, `usda`,
`analysis`).

---

Expand All @@ -32,41 +33,49 @@ The ETL system runs in a containerized Prefect environment.
pixi run start-services
```

**Step 2: Deploy Flows**
**Step 2: Apply Datamodel**

```bash
pixi run migrate
```

**Step 3: Deploy Flows**

```bash
pixi run deploy
```

**Step 3: Run the Master Pipeline**
**Step 4: Run the Master Pipeline**

```bash
pixi run run-etl
```

**Step 4: Monitor** Access the Prefect UI at
**Step 5: Monitor** Access the Prefect UI at
[http://localhost:4200](http://localhost:4200).

---

### How to Add a New ETL Flow

**Step 1: Create the Task Files** Create the three Python files for your
extract, transform, and load logic in the appropriate subdirectories under
`src/ca_biositing/pipeline/ca_biositing/pipeline/etl/`. Decorate each function
with `@task`.
extract, transform, and load logic under
`src/ca_biositing/pipeline/ca_biositing/pipeline/etl/`. Extract tasks go
directly in `extract/`; transform and load tasks go in appropriately named
subdirectories (e.g., `transform/products/`, `load/products/`). Decorate each
function with `@task`.

**Step 2: Create the Pipeline Flow** Create a new file in
`src/ca_biositing/pipeline/ca_biositing/pipeline/flows/` to define the flow.

```python
from prefect import flow
from ca_biositing.pipeline.etl.extract.samples.new_type import extract
from ca_biositing.pipeline.etl.transform.samples.new_type import transform
from ca_biositing.pipeline.etl.load.samples.new_type import load
from ca_biositing.pipeline.etl.extract.my_source import extract
from ca_biositing.pipeline.etl.transform.products.my_product import transform
from ca_biositing.pipeline.etl.load.products.my_product import load

@flow
def new_type_flow():
def my_product_flow():
raw_data = extract()
transformed_data = transform(raw_data)
load(transformed_data)
Expand Down
5 changes: 2 additions & 3 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
site_name: CA-BioSiting
repo_url: https://github.com/uw-ssec/ca-biositing
repo_url: https://github.com/sustainability-software-lab/ca-biositing

theme:
name: material
Expand Down Expand Up @@ -51,11 +51,11 @@ nav:
- Notebook Setup: notebook_setup.md
- Pipeline:
- Overview: pipeline/README.md
- GCP Setup: pipeline/GCP_SETUP.md
- ETL Workflow: pipeline/ETL_WORKFLOW.md
- Alembic Workflow: pipeline/ALEMBIC_WORKFLOW.md
- Docker Workflow: pipeline/DOCKER_WORKFLOW.md
- Prefect Workflow: pipeline/PREFECT_WORKFLOW.md
- GCP Setup: pipeline/GCP_SETUP.md
- USDA ETL Guide: pipeline/USDA/USDA_ETL_GUIDE.md
- Datamodels:
- Overview: datamodels/README.md
Expand All @@ -71,4 +71,3 @@ nav:
- Deployment: deployment/README.md
- Contributing: CONTRIBUTING.md
- Code of Conduct: CODE_OF_CONDUCT.md
- ERD View: ERD_VIEW.md
21 changes: 3 additions & 18 deletions pixi.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading