This file provides guidance to AI assistants when working with this repository.
This is the ca-biositing repository, a geospatial bioeconomy project for biositing analysis in California. It provides tools for ETL (Extract, Transform, Load) pipelines to process data from Google Sheets into PostgreSQL databases, geospatial analysis using QGIS, a REST API for data access, and research prototyping for bioeconomy site selection.
Repository Stats:
- Type: Research project / Data processing / Geospatial analysis
- Architecture: PEP 420 namespace packages
- Build System: Pixi (v0.55.0+) + Docker for ETL pipeline orchestration
- Platform: macOS (osx-arm64, osx-64), Linux (linux-64, linux-aarch64), Windows (win-64)
- Languages: Python (primary), SQL, TOML, Jupyter Notebooks
- Domain: Geospatial analysis, bioeconomy, ETL pipelines, database management
Key Components:
ca_biositing.datamodels: Hand-written SQLModel database models with Alembic migrations. Models are inmodels/subdirectories, views inviews.py.ca_biositing.pipeline: Prefect-orchestrated ETL workflows (Docker-based).ca_biositing.webservice: FastAPI REST API.resources/: Docker Compose and Prefect deployment configuration.frontend/: Git submodule for frontend application.
For detailed guidance on shared topics, see the agent_docs/ directory:
| Topic | Document | Description |
|---|---|---|
| Namespace Packages | agent_docs/namespace_packages.md | PEP 420 structure, import patterns |
| Testing Patterns | agent_docs/testing_patterns.md | pytest fixtures, test commands |
| Code Quality | agent_docs/code_quality.md | Pre-commit, style, imports |
| Troubleshooting | agent_docs/troubleshooting.md | Common pitfalls and solutions |
| Docker Workflow | agent_docs/docker_workflow.md | Docker/Pixi service commands |
| SQL-First Workflow | docs/datamodels/SQL_FIRST_WORKFLOW.md | Rapid schema iteration path |
This project uses Pixi for local development environment and dependency management. Pixi handles both Conda and PyPI dependencies.
ALWAYS use Pixi commands—never use conda, pip, or venv directly for local development.
Note: The ETL pipeline runs in Docker containers orchestrated by Pixi tasks. Pixi is used for:
- Managing Docker services (start, stop, logs, status)
- Running code quality tools (pre-commit)
- Running tests (pytest)
- Running QGIS for geospatial analysis
- Deploying and running Prefect workflows
- Starting the FastAPI web service
# Install the default environment (required before any other commands)
pixi installCRITICAL: Always run pixi install before any other Pixi commands.
default: Main development environment (datamodels, pipeline, webservice, gis).gis: Geospatial analysis (QGIS, rasterio, xarray, shapely, pyproj).etl: ETL pipeline environment (Prefect, SQLAlchemy, pandas).webservice: Web service environment (FastAPI, uvicorn).frontend: Frontend development (Node.js, npm).docs: Documentation (MkDocs).deployment: Cloud infrastructure (Pulumi).
The repository uses PEP 420 namespace packages. The top-level namespace
ca_biositing is shared across three independent distributions:
src/ca_biositing/datamodelssrc/ca_biositing/pipelinesrc/ca_biositing/webservice
All three are installed as editable PyPI packages into the Pixi environment via
pixi.toml feature dependencies. No PYTHONPATH manipulation is needed — after
pixi install, all packages are importable within any pixi run command.
This project uses pixi-kernel for Jupyter integration. The kernel runs code inside the Pixi environment, so all installed packages are available without any path configuration.
- Run
pixi installto set up the environment (includespixi-kernel). - In VS Code or JupyterLab, select the Pixi kernel for your notebook.
- All
ca_biositingnamespace packages are importable directly.
All schema changes are managed through SQLModel classes and Alembic migrations. There is no code generation step.
- Edit Models: Modify SQLModel classes in
src/ca_biositing/datamodels/ca_biositing/datamodels/models/. If adding a new model, also add its import tomodels/__init__.py. - Auto-Generate Migration:
pixi run migrate-autogenerate -m "Description of changes" - Review: Check the generated script in
alembic/versions/. - Apply Migration:
pixi run migrate
Note: migrate runs alembic upgrade head locally against the Docker-hosted
PostgreSQL.
Seven materialized views are defined in
src/ca_biositing/datamodels/ca_biositing/datamodels/views.py. They are managed
through manual Alembic migrations (not autogenerated). Refresh after data loads:
pixi run refresh-viewsThe pgschema tool can still be used for validation (diffing live DB against
reference SQL), but it does not modify the database:
pixi run schema-plan: Diff public schema.pixi run schema-analytics-plan: Diff analytics schema.pixi run schema-analytics-list: List materialized views.pixi run schema-dump: Dump DB state to SQL files.
The core data layer.
ca_biositing/datamodels/models/: Hand-written SQLModel classes (91 models, 15 domain subdirectories).ca_biositing/datamodels/views.py: Materialized view definitions.ca_biositing/datamodels/database.py: Connection and session management.- See
src/ca_biositing/datamodels/README.md.
The ETL orchestration layer.
ca_biositing/pipeline/etl/: Extract, Transform, and Load tasks.ca_biositing/pipeline/flows/: Prefect flow definitions.ca_biositing/pipeline/utils/: Helpers likegsheet_to_pandas.pyandlookup_utils.py.- See
src/ca_biositing/pipeline/README.md.
The API layer.
ca_biositing/webservice/main.py: FastAPI application entrypoint.ca_biositing/webservice/v1/endpoints/: API route handlers.- See
src/ca_biositing/webservice/README.md.
pixi run start-services: Start DB and Prefect.pixi run service-status: Check container health.pixi run service-logs: View logs.pixi run rebuild-services: Rebuild images after dependency changes.
pixi run pre-commit-all: Run all checks (MANDATORY before PR).pixi run test: Run pytest suite.pixi run start-webservice: Launch API locally.
- Import Errors: Ensure you are running inside the Pixi environment
(
pixi run ...) or using the Pixi kernel in Jupyter. - macOS Geospatial Library Conflicts: On macOS (especially Apple Silicon),
you may encounter
PROJdatabase version mismatch errors when usinggeopandasorpyogrio.- Fix: Ensure
PROJ_LIBenvironment variable is set to the Pixi environment's data directory before importing geospatial libraries. - Example:
import os, pyproj os.environ['PROJ_LIB'] = pyproj.datadir.get_data_dir() import geopandas as gpd
- Fix: Ensure
- Docker Hangs & Deadlocks:
- CRITICAL: Never import heavy SQLAlchemy models or Pydantic settings at
the module level. Always import them inside the
@taskor@flowfunction. - CRITICAL: Avoid any network I/O or database connectivity tests at the module level (import time).
- Docker Networking: Inside containers, use the
dbhostname for PostgreSQL. Outside, uselocalhost.
- CRITICAL: Never import heavy SQLAlchemy models or Pydantic settings at
the module level. Always import them inside the
- GCP Auth:
credentials.jsonmust be in the root for ETL tasks. Seedocs/pipeline/GCP_SETUP.md.
For more information on SSEC best practices, see: https://rse-guidelines.readthedocs.io/en/latest/llms-full.txt