Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
ee13d63
adding USDA yaml schemas to test LinkML
petercarbsmith Oct 29, 2025
fe0312b
feat:Adding linkml and modifying alembic .env to ensure compatiblity.…
petercarbsmith Oct 30, 2025
f5eed11
adding win-64 to platforms in pixi toml
petercarbsmith Oct 30, 2025
584b1f7
total test
mglbleta Oct 30, 2025
a000cae
test is our commit message
mglbleta Oct 30, 2025
d0373aa
Removing test file and revising docker setup instructions in README.md
mglbleta Nov 2, 2025
d53ade3
Adding notes to env.example which would've explained points I got con…
mglbleta Nov 2, 2025
04743bc
potentially fixed whitespace
mglbleta Nov 2, 2025
cbccecd
Added explation for @task instruction
mglbleta Nov 3, 2025
603e071
trying to commit again
petercarbsmith Nov 14, 2025
8bd307d
feat: attempting to add schemas folder to datamodels module and impor…
petercarbsmith Nov 14, 2025
ade4694
added a jupyter notebook for playing around with connecting to db
petercarbsmith Nov 14, 2025
dca4ca6
fixing alembic migration issues. Commenting out all SQLalchemy stuff …
petercarbsmith Nov 18, 2025
b9c0560
needed to clear alembic migrations first
petercarbsmith Nov 18, 2025
120d608
alembic issue fixed. sqlalchemy commented out for now
petercarbsmith Nov 18, 2025
6e4efce
adding asyncpg to root pixi.toml
petercarbsmith Nov 19, 2025
eee7e86
rebasing to upstream main. Hopefully everythign still works
petercarbsmith Dec 8, 2025
50687aa
feat: first pass at putting in new linkml schemas. They need to be ch…
petercarbsmith Dec 8, 2025
cbf8242
AI hallucinations are messing up yaml files. Attempting to fix.
petercarbsmith Dec 8, 2025
a4758c8
generated sqla from linkml.STILL NEEDS A CHECK. Next up alembic.
petercarbsmith Dec 8, 2025
cecc4d9
feat: generated sqlA now is being read by Alembic and schema was succ…
petercarbsmith Dec 8, 2025
7987cda
feat: sqla to alembic complete. new pixi task. docs updated.
petercarbsmith Dec 8, 2025
1a4e470
quick clean up before push to upstream
petercarbsmith Dec 9, 2025
6e2a1c9
changes to pixi.toml and some comments on linkml schema
mglbleta Dec 12, 2025
2fcbf18
added some comments to organize and ask questions
mglbleta Dec 13, 2025
7558f39
Add and edit revisions to schema (untested)
mglbleta Dec 16, 2025
d5ea8e1
Making people group full of base entity
mglbleta Dec 16, 2025
21baf81
made changes to try to make an alembic migration to represent the cha…
mglbleta Dec 16, 2025
10ff5c0
hallucinations removed working update
mglbleta Dec 17, 2025
032cf1e
removed useless alembic versions and fixed the revision path
mglbleta Dec 17, 2025
2915797
removed duplicate .yaml files
mglbleta Dec 17, 2025
2b5f9ff
Merge pull request #14 from mglbleta/Mei_LinkML
petercarbsmith Dec 19, 2025
2d5bc26
cleaning up a last few things before merge
petercarbsmith Dec 19, 2025
8665c0e
Merge pull request #15 from petercarbsmith/Peter_LinkML
petercarbsmith Dec 19, 2025
536f209
adding USDA yaml schemas to test LinkML
petercarbsmith Oct 29, 2025
98af1b2
feat:Adding linkml and modifying alembic .env to ensure compatiblity.…
petercarbsmith Oct 30, 2025
5f84c4e
adding win-64 to platforms in pixi toml
petercarbsmith Oct 30, 2025
44417f0
total test
mglbleta Oct 30, 2025
719fe29
test is our commit message
mglbleta Oct 30, 2025
a00253f
Removing test file and revising docker setup instructions in README.md
mglbleta Nov 2, 2025
f307b99
Adding notes to env.example which would've explained points I got con…
mglbleta Nov 2, 2025
5e9fd3e
potentially fixed whitespace
mglbleta Nov 2, 2025
9852781
Added explation for @task instruction
mglbleta Nov 3, 2025
719aa94
trying to commit again
petercarbsmith Nov 14, 2025
883a01a
feat: attempting to add schemas folder to datamodels module and impor…
petercarbsmith Nov 14, 2025
d53f226
added a jupyter notebook for playing around with connecting to db
petercarbsmith Nov 14, 2025
3490faa
fixing alembic migration issues. Commenting out all SQLalchemy stuff …
petercarbsmith Nov 18, 2025
8458b18
needed to clear alembic migrations first
petercarbsmith Nov 18, 2025
4773404
alembic issue fixed. sqlalchemy commented out for now
petercarbsmith Nov 18, 2025
8048a0c
adding asyncpg to root pixi.toml
petercarbsmith Nov 19, 2025
3404501
rebasing to upstream main. Hopefully everythign still works
petercarbsmith Dec 8, 2025
b7f5df7
feat: first pass at putting in new linkml schemas. They need to be ch…
petercarbsmith Dec 8, 2025
de92cb0
AI hallucinations are messing up yaml files. Attempting to fix.
petercarbsmith Dec 8, 2025
269d11d
generated sqla from linkml.STILL NEEDS A CHECK. Next up alembic.
petercarbsmith Dec 8, 2025
e81650d
feat: generated sqlA now is being read by Alembic and schema was succ…
petercarbsmith Dec 8, 2025
2e0b778
feat: sqla to alembic complete. new pixi task. docs updated.
petercarbsmith Dec 8, 2025
8934429
quick clean up before push to upstream
petercarbsmith Dec 9, 2025
ec6f741
changes to pixi.toml and some comments on linkml schema
mglbleta Dec 12, 2025
9a53350
added some comments to organize and ask questions
mglbleta Dec 13, 2025
7c62775
Add and edit revisions to schema (untested)
mglbleta Dec 16, 2025
61fcf80
Making people group full of base entity
mglbleta Dec 16, 2025
942b4d4
made changes to try to make an alembic migration to represent the cha…
mglbleta Dec 16, 2025
4c7d9d7
hallucinations removed working update
mglbleta Dec 17, 2025
f78d9bb
removed useless alembic versions and fixed the revision path
mglbleta Dec 17, 2025
e074d3c
removed duplicate .yaml files
mglbleta Dec 17, 2025
4f9f759
cleaning up a last few things before merge
petercarbsmith Dec 19, 2025
eb90e0f
adding database.py and config.py to datamodels folder. modifying ca_b…
petercarbsmith Dec 22, 2025
5574630
adding comment to init.py to try to solve CI CD issues
petercarbsmith Dec 22, 2025
e0947fc
fixed primary_ag_product etl pipeline, modified master flow script an…
petercarbsmith Dec 23, 2025
fd235be
holiday work. New notebook for importing gsheeets to pandas and playi…
petercarbsmith Dec 30, 2025
907686b
feat: notebook for gsheets extraction, data playground, and pk id loo…
petercarbsmith Jan 3, 2026
5e961aa
modifying name_id_swap_function
petercarbsmith Jan 3, 2026
6d7990b
was trying to fix dev container kernel issue. Did not succeed. Also m…
petercarbsmith Jan 5, 2026
f4b06ec
modified etl_notebook. Db is now running on localhost
petercarbsmith Jan 6, 2026
5f1e0a4
"did the module refactor but may need to mess with the alembic env.py…
petercarbsmith Jan 7, 2026
234a235
fixed the alembic import problems. models now have fk
petercarbsmith Jan 7, 2026
b08f15e
modified engine.py to get correct database url from env
avi9664 Jan 11, 2026
1c0d681
feat (pipeline): extract data from infrastructure gdrive folder
avi9664 Jan 14, 2026
daa285b
feat (pipeline): extract data from infrastructure gdrive folder (push…
avi9664 Jan 14, 2026
c34df14
feat (pipeline): extract .zip and .geojson files from gdrive
avi9664 Jan 14, 2026
b467d73
moved drive extraction scripts to their own files
avi9664 Jan 14, 2026
e063399
added template scripts for extracting from gdrive
avi9664 Jan 15, 2026
bb28100
docs: add style guide
avi9664 Jan 15, 2026
512678f
Merge remote-tracking branch 'upstream/main' into avi-external-etl
avi9664 Apr 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -404,7 +404,7 @@ ca-biositing/
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ transform/ # Transform tasks
β”‚ β”‚ β”‚ β”‚ └── load/ # Load tasks
β”‚ β”‚ β”‚ β”œβ”€β”€ flows/ # Prefect flows
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ primary_product.py
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ primary_ag_product.py
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€ analysis_type.py
β”‚ β”‚ β”‚ β”‚ └── ...
β”‚ β”‚ β”‚ └── utils/ # Utilities
Expand Down
61 changes: 61 additions & 0 deletions ERD_VIEW.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# ERD for USDA Census and Survey Data

To view this diagram, open the Command Palette (`Cmd+Shift+P` on Mac or
`Ctrl+Shift+P` on Windows/Linux) and run **"Markdown: Open Preview to the
Side"**.

```mermaid
erDiagram
CensusRecord {
integer year
CropEnum crop
VariableEnum variable
UnitEnum unit
float value
BearingStatusEnum bearing_status
string class_desc
string domain_desc
string source
string notes
}
Geography {
string state_name
string state_fips
string county_name
string county_fips
string geoid
string region_name
string agg_level_desc
}
SurveyRecord {
string period_desc
string freq_desc
string program_desc
integer year
CropEnum crop
VariableEnum variable
UnitEnum unit
float value
BearingStatusEnum bearing_status
string class_desc
string domain_desc
string source
string notes
}
UsdaRecord {
integer year
CropEnum crop
VariableEnum variable
UnitEnum unit
float value
BearingStatusEnum bearing_status
string class_desc
string domain_desc
string source
string notes
}

CensusRecord ||--|o Geography : "geography"
SurveyRecord ||--|o Geography : "geography"
UsdaRecord ||--|o Geography : "geography"
```
304 changes: 279 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,299 @@
# ca-biositing

Discussion of general issues related to the project and protyping or research
A geospatial bioeconomy project for biositing analysis in California. This
repository provides tools for ETL pipelines to process data from Google Sheets
into PostgreSQL databases, geospatial analysis using QGIS, and a REST API for
data access.

## Project Structure

This project uses a **PEP 420 namespace package** structure with three main
components:

- **`ca_biositing.datamodels`**: Shared LinkML/SQLModel database models and
database configuration
- **`ca_biositing.pipeline`**: ETL pipelines orchestrated with Prefect, deployed
via Docker
- **`ca_biositing.webservice`**: FastAPI REST API for data access

### Directory Layout

```text
ca-biositing/
β”œβ”€β”€ src/ca_biositing/ # Namespace package root
β”‚ β”œβ”€β”€ datamodels/ # Database models (SQLModel)
β”‚ β”œβ”€β”€ pipeline/ # ETL pipelines (Prefect)
β”‚ └── webservice/ # REST API (FastAPI)
β”œβ”€β”€ resources/ # Deployment resources
β”‚ β”œβ”€β”€ docker/ # Docker Compose configuration
β”‚ └── prefect/ # Prefect deployment files
β”œβ”€β”€ tests/ # Integration tests
β”œβ”€β”€ pixi.toml # Pixi dependencies and tasks
└── pixi.lock # Dependency lock file
```

## Quick Start

### Prerequisites

- **Pixi** (v0.55.0+):
[Installation Guide](https://pixi.sh/latest/#installation)
- **Docker**: For running the ETL pipeline
- **Google Cloud credentials**: For Google Sheets access (optional)

### Installation

```bash
# Clone the repository
git clone https://github.com/uw-ssec/ca-biositing.git
cd ca-biositing

# Install dependencies with Pixi
pixi install

# Install pre-commit hooks
pixi run pre-commit-install
```

## Relevant Links for project documentations and context
### Running Components

- eScience Slack channel: πŸ”’
[#ssec-ca-biositing](https://escience-institute.slack.com/archives/C098GJCTTFE)
- SSEC Sharepoint (**INTERNAL SSEC ONLY**): πŸ”’
[Projects/GeospatialBioeconomy](https://uwnetid.sharepoint.com/:f:/r/sites/og_ssec_escience/Shared%20Documents/Projects/GeospatialBioeconomy?csf=1&web=1&e=VBUGQG)
- Shared Sharepoint Directory: πŸ”’
[SSEC CA Biositing Shared Folder](https://uwnetid.sharepoint.com/:f:/r/sites/og_ssec_escience/Shared%20Documents/Projects/GeospatialBioeconomy/SSEC%20CA%20Biositing%20Shared%20Folder?csf=1&web=1&e=p5wBel)
#### ETL Pipeline (Prefect + Docker)

## General Discussions
**Note**: Before starting the services for the first time, create the required
environment file from the template:

For general discussion, ideas, and resources please use the
[GitHub Discussions](https://github.com/uw-ssec/ca-biositing/discussions).
However, if there's an internal discussion that need to happen, please use the
slack channel provided.
```bash
cp resources/docker/.env.example resources/docker/.env
```

- Meeting Notes in GitHub:
[discussions/meetings](https://github.com/uw-ssec/ca-biositing/discussions/categories/meetings)
Then start and use the services:

```bash
# 1. Create the initial database migration script
# (This is only needed once for a new database)
pixi run initial-migration

# 2. Start all services (PostgreSQL, Prefect server, worker)
# This will also automatically apply any pending database migrations.
pixi run start-services

# 3. Deploy flows to Prefect
pixi run deploy

# Run the ETL pipeline
pixi run run-etl

# Monitor via Prefect UI: http://localhost:4200

# To apply new migrations after the initial setup
pixi run migrate

# Stop services
pixi run teardown-services
```

See [`resources/README.md`](resources/README.md) for detailed pipeline
documentation.

## Questions
#### Web Service (FastAPI)

If you have any questions about our process, or locations of SSEC resources,
please ask [Anshul Tambay](https://github.com/atambay37).
```bash
# Start the web service
pixi run start-webservice

## QGIS
# Access API docs: http://localhost:8000/docs
```

This project includes QGIS for geospatial analysis and visualization. You can
run QGIS using pixi with the following command:
#### QGIS (Geospatial Analysis)

```bash
pixi run qgis
```

This will launch QGIS in the `gis` environment with all necessary dependencies
installed.
**Note**: On macOS, you may see a Python faulthandler error - this is expected
and can be ignored. See
[QGIS Issue #52987](https://github.com/qgis/QGIS/issues/52987).

## Development

### Running Tests

```bash
# Run all tests
pixi run test

# Run tests with coverage
pixi run test-cov
```

### Code Quality

```bash
# Run pre-commit checks on staged files
pixi run pre-commit

# Run pre-commit on all files (before PR)
pixi run pre-commit-all
```

### Available Pixi Tasks

View all available tasks:

```bash
pixi task list
```

Key tasks:

- **Service Management**: `start-services`, `teardown-services`,
`service-status`
- **ETL Operations**: `deploy`, `run-etl`
- **Development**: `test`, `test-cov`, `pre-commit`, `pre-commit-all`
- **Applications**: `start-webservice`, `qgis`
- **Database**: `access-db`, `check-db-health`
- **Datamodels**: `update-schema`, `migrate`

## Architecture

### Namespace Packages

This project uses **PEP 420 namespace packages** to organize code into
independently installable components that share a common namespace:

- Each component has its own `pyproject.toml` and can be installed separately
- Shared models in `datamodels` are used by both `pipeline` and `webservice`
- Clear separation of concerns while maintaining type consistency

### ETL Pipeline

The ETL pipeline uses:

- **Prefect**: Workflow orchestration and monitoring
- **Docker**: Containerized execution environment
- **PostgreSQL**: Data persistence
- **Google Sheets API**: Primary data source

Pipeline architecture:

1. **Extract**: Pull data from Google Sheets
2. **Transform**: Clean and normalize data with pandas
3. **Load**: Insert/update records in PostgreSQL via SQLModel

### Database Models

We use a **LinkML-first approach** for defining our data schema. The workflow
is:

1. **LinkML Schema**: The schema is defined in YAML files (source of truth).
2. **SQLAlchemy Generation**: Python classes are automatically generated from
LinkML.
3. **Alembic Migrations**: Database migrations are generated from the Python
classes.

SQLModel-based models provide:

- Type-safe database operations
- Automatic schema generation (via Alembic)
- Shared models across ETL and API components
- Pydantic validation

## Project Components

### 1. Data Models (`ca_biositing.datamodels`)

Database models for:

- Biomass data (field samples, measurements)
- Geographic locations
- Experiments and analysis
- Metadata and samples
- Organizations and contacts

**Documentation**:
[`src/ca_biositing/datamodels/README.md`](src/ca_biositing/datamodels/README.md)

### 2. ETL Pipeline (`ca_biositing.pipeline`)

Prefect-orchestrated workflows for:

- Data extraction from Google Sheets
- Data transformation and validation
- Database loading and updates
- Lookup table management

**Documentation**:
[`src/ca_biositing/pipeline/README.md`](src/ca_biositing/pipeline/README.md)

**Guides**:

- [Docker Workflow](src/ca_biositing/pipeline/docs/DOCKER_WORKFLOW.md)
- [Prefect Workflow](src/ca_biositing/pipeline/docs/PREFECT_WORKFLOW.md)
- [ETL Development](src/ca_biositing/pipeline/docs/ETL_WORKFLOW.md)
- [Database Migrations](src/ca_biositing/pipeline/docs/ALEMBIC_WORKFLOW.md)

### 3. Web Service (`ca_biositing.webservice`)

FastAPI REST API providing:

- Read access to database records
- Interactive API documentation (Swagger/OpenAPI)
- Type-safe endpoints using Pydantic

**Documentation**:
[`src/ca_biositing/webservice/README.md`](src/ca_biositing/webservice/README.md)

### 4. Deployment Resources (`resources/`)

Docker and Prefect configuration for:

- Service orchestration (Docker Compose)
- Prefect deployments
- Database initialization

**Documentation**: [`resources/README.md`](resources/README.md)

## Adding Dependencies

### For Local Development (Pixi)

```bash
# Add conda package to default environment
pixi add <package-name>

# Add PyPI package to default environment
pixi add --pypi <package-name>

# Add to specific feature (e.g., pipeline)
pixi add --feature pipeline --pypi <package-name>
```

### For ETL Pipeline (Docker)

The pipeline dependencies are managed by Pixi's `etl` environment feature in
`pixi.toml`. When you add dependencies and rebuild Docker images, they are
automatically included:

```bash
# Add dependency to pipeline feature
pixi add --feature pipeline --pypi <package-name>

# Rebuild Docker images
pixi run rebuild-services

# Restart services
pixi run start-services
```

## Environment Management

This project uses **Pixi environments** for different workflows:

For MacOS, there will be a Python error about faulthandler, which is expected
and can be ignored, see https://github.com/qgis/QGIS/issues/52987.
- **`default`**: General development, testing, pre-commit hooks
- **`gis`**: QGIS and geospatial analysis tools
- **`etl`**: ETL pipeline (used in Docker containers)
- **`webservice`**: FastAPI web service
- **`frontend`**: Node.js/npm for frontend development

## Frontend Integration

Expand Down
Loading