Salmon Water Risk Backend (Hackathon MVP)

FastAPI backend MVP for detecting water-related environmental risk proxies along salmon migratory paths using Sentinel-2 style inputs.

This repository is optimized for an end-to-end local demo:

local scene ingestion (fully implemented)
raster preprocessing and spectral index calculations
water-only risk scoring (0..1)
heatmap-ready GeoJSON grid output
migration-path-aware risk summaries
3-layer deterministic cache with size cap + eviction
Prithvi model training scaffold (non-blocking)

What Is Implemented Now

Real MVP functionality

Local Sentinel-like ingestion via POST /ingest/sentinel (provider=local)
Band validation (B3, B4, B5, B8)
Raster loading with reprojection to a reference grid when needed
Optional AOI clipping
Spectral proxies:
- chlorophyll/algal proxy: (B5 - B4) / (B5 + B4)
- turbidity proxy: B4 / B3
- NDWI water mask: (B3 - B8) / (B3 + B8)
Water detection modes:
- spectral (default): NDWI + NIR/green + NDVI gates (+ SWIR refinement when B11/B12 are present)
- auto: deterministic local spectral mode (no remote model call)
- pretrained: explicit pretrained attempt, with spectral fallback metadata if unavailable
Water-only risk scoring:
- risk_raw = 0.5*chlorophyll + 0.3*turbidity + 0.2*temperature_proxy
- temperature_proxy is currently a stub (0.0) with explicit TODO
- normalized risk in 0..1 on water pixels
GeoJSON grid aggregation with risk category metadata for frontend heatmaps
Migration path loading from static GeoJSON + buffered intersection summary
Artifact persistence (GeoTIFF + JSON + GeoJSON) and local registry tracking
Cache policy:
- tiny metadata cache (cache/metadata)
- clipped-band cache (cache/clipped)
- derived risk cache (cache/derived, cache/tiles)
- deterministic cache keys and oldest-first eviction

Scaffolded (future-facing)

provider=sentinel remote ingestion interface exists but is intentionally stubbed
Prithvi integration endpoints exist, with dataset validation and job tracking scaffold
No heavyweight model download/fine-tuning required for MVP run

Architecture

app/
  main.py
  api/
    routes_health.py
    routes_ingest.py
    routes_process.py
    routes_risk.py
    routes_migration.py
    routes_train.py
  core/
    config.py
    constants.py
    logging.py
  models/
    schemas.py
  services/
    container.py
    ingestion/
      base.py
      local_provider.py
      sentinel_provider.py
      service.py
    processing/
      raster.py
      indices.py
      water_mask.py
      risk.py
      grid.py
      service.py
    migration/
      loader.py
      summarizer.py
    training/
      dataset.py
      inference.py
      prithvi.py
      weak_labels.py
    storage/
      metadata_store.py
      cache_manager.py
  utils/
    bands.py
    geospatial.py
scripts/
  build_dataset.py
  tile_dataset.py
  train.py
  infer.py
data/
  raw/
  processed/
  sample/
  migration_paths/
artifacts/
tests/

Cache Strategy (3 Layers)

The backend now uses only these cache layers under artifacts/cache/:

metadata/ (tiny JSON records)

stores scene_id, bbox, date, cloud_cover, source_urls, request_hash
avoids repeated ingestion/search calls for identical requests

clipped/ (clip-first band cache)

stores only corridor/AOI-clipped bands
stores only required Sentinel bands (B3, B4, B5, B8) and optional B2/B11/B12 if present
writes compressed GeoTIFFs (LZW), optional downsample via CLIPPED_CACHE_MAX_DIMENSION
never stores full scenes in clipped cache

derived/ and tiles/ (demo-ready outputs)

derived/*.npz: chlorophyll, turbidity, NDWI, normalized risk arrays
derived/*.summary.json: summary and thresholds
tiles/*.geojson: heatmap-ready polygons

Deterministic Cache Key

Format:

{dataset}_{date}_{bboxhash}_{resolution}

Example:

sentinel2_2026-03-07_a13f92_native-g32

Model prediction caches append model identity into the resolution fragment, e.g.:

sentinel2_2026-03-07_a13f92_native-g32-m1a2b3c4

Eviction Policy

configurable max cache size (CACHE_MAX_SIZE_GB, default 10)
oldest files removed first from clipped/derived/tile layers

Data Flow

Scene ingestion (/ingest/sentinel)
Scene metadata registration in local JSON registry
Band validation and loading
Raster alignment/reprojection and optional AOI clipping
Chlorophyll proxy computation
Turbidity proxy computation
NDWI and water mask generation
Water-only filtering
Raw + normalized risk computation (0..1)
Grid aggregation to heatmap-ready GeoJSON features
Optional migration-path buffered intersection summary
API delivery of summaries/artifact references/grid features

Dataset-First Training Flow

Order used in this repo:

scripts/build_dataset.py

input: local scene assets + bbox
output: scene-level arrays (4-band, risk_norm, water_mask, binary_label)
writes manifests and metadata for repeatability

scripts/tile_dataset.py

converts scene arrays into train/val tiles (default 256x256)
keeps tiles with sufficient water and a balanced positive/negative mix
writes tiles_manifest.jsonl

scripts/train.py

trains a lightweight binary segmentation baseline on tiled weak labels
target is label = (risk_norm >= 0.65) (water-aware)
--model prithvi-head is scaffolded and currently falls back to baseline model

Risk Thresholds

Configured in environment/config (not hardcoded throughout code):

blue: baseline/low
yellow: risk >= 0.30
red: risk >= 0.65
infrared: risk >= 0.85

Setup

Create and activate a Python 3.11+ environment.
Install dependencies:

pip install -r requirements.txt

For Python 3.13 environments, use:

pip install -r requirements-py313.txt

Copy env defaults:

cp .env.example .env

Run API:

uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Optional for local training script:

python -m pip install torch

Optional for pretrained water detector (6-band Sentinel scenes):

python -m pip install geoai-py

Then set in .env:

WATER_DETECTOR_MODE=pretrained
PRETRAINED_WATER_MODEL_REPO_ID=geoai4cities/sentinel2-water-segmentation
HUGGINGFACE_TOKEN=hf_xxx_your_read_token

API Endpoints

GET /health
GET /migration-paths
POST /ingest/sentinel
POST /process/scene
POST /risk/score
POST /risk/predict
GET /risk/tiles
POST /train/prithvi
GET /train/status

Example Calls

1) Ingest local Sentinel-like scene

curl -X POST http://localhost:8000/ingest/sentinel \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "local",
    "local": {
      "scene_name": "demo-scene",
      "assets": {
        "B3": "data/sample/B3.tif",
        "B4": "data/sample/B4.tif",
        "B5": "data/sample/B5.tif",
        "B8": "data/sample/B8.tif"
      }
    }
  }'

2) Process scene to produce artifacts and grid

curl -X POST http://localhost:8000/process/scene \
  -H "Content-Type: application/json" \
  -d '{
    "scene_id": "scene-abc123",
    "include_grid": true,
    "grid_block_size": 32,
    "migration_path_id": "columbia-mainstem"
  }'

3) Fetch heatmap-ready GeoJSON

curl "http://localhost:8000/risk/tiles?processed_scene_id=proc-abc123"

3b) Predict with trained model and then fetch model tiles

curl -X POST http://localhost:8000/risk/predict \
  -H "Content-Type: application/json" \
  -d '{
    "scene_id": "scene-abc123",
    "model_checkpoint": "artifacts/models/weakrisk_baseline/best.pt",
    "grid_block_size": 32
  }'

Then request tiles using model source (same frontend shape):

curl "http://localhost:8000/risk/tiles?source=model&prediction_id=pred-abc123"

4) Build weak-label dataset sample

python scripts/build_dataset.py \
  --scene-dir data/sample \
  --bbox -123.2,49.1,-122.8,49.4

5) Tile dataset for training

python scripts/tile_dataset.py \
  --input-root data/processed \
  --tile-size 256 \
  --stride 256

6) Train baseline segmentation model

python scripts/train.py \
  --manifest data/processed/tiles_manifest.jsonl \
  --epochs 10 \
  --batch-size 8

7) CLI inference with cache-aware model prediction

python scripts/infer.py \
  --scene-id scene-abc123 \
  --model-checkpoint artifacts/models/weakrisk_baseline/best.pt

Environment Variables

See .env.example for full list. Core settings include:

app runtime (APP_NAME, ENVIRONMENT, HOST, PORT)
storage and registry paths
cache cap and clipping controls (CACHE_MAX_SIZE_GB, CLIPPED_CACHE_MAX_DIMENSION)
migration path file path
NDWI and heatmap thresholds
water detector mode and gates (WATER_DETECTOR_MODE, WATER_NIR_TO_GREEN_RATIO_MAX, WATER_NDVI_MAX)
optional pretrained water model repo id (PRETRAINED_WATER_MODEL_REPO_ID)
Sentinel API placeholders (stub integration)
Prithvi model name/flags (scaffold)

Local Data Assumptions

Local ingestion is the primary MVP path.
Input assets are GeoTIFF bands with keys mappable to B3, B4, B5, B8.
Bands should cover the same area; mismatch is handled with reprojection to a reference grid.
For pretrained water detection, provide additional Sentinel bands B2, B11, and B12.

Testing

Run:

pytest -q

Tests include:

health endpoint
index behavior
risk normalization
migration path loading
ingestion + processing API smoke test with synthetic GeoTIFFs

Limitations / TODOs

Remote Sentinel ingestion (provider=sentinel) is scaffolded, not production-integrated.
Temperature proxy is a stub (0.0) to keep MVP deterministic and lightweight.
Prithvi training endpoint is scaffolded; script-level --model prithvi-head currently falls back to baseline until backbone integration is completed.
/risk/tiles returns GeoJSON grid features (not XYZ tile server), which is intentional for fast frontend integration.
Pretrained water detection requires optional dependency geoai-py and 6-band Sentinel input (B2,B3,B4,B8,B11,B12).
Metadata cache uses JSON files for hackathon simplicity; SQLite is an easy future swap if concurrency grows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Salmon Water Risk Backend (Hackathon MVP)

What Is Implemented Now

Real MVP functionality

Scaffolded (future-facing)

Architecture

Cache Strategy (3 Layers)

Deterministic Cache Key

Eviction Policy

Data Flow

Dataset-First Training Flow

Risk Thresholds

Setup

API Endpoints

Example Calls

1) Ingest local Sentinel-like scene

2) Process scene to produce artifacts and grid

3) Fetch heatmap-ready GeoJSON

3b) Predict with trained model and then fetch model tiles

4) Build weak-label dataset sample

5) Tile dataset for training

6) Train baseline segmentation model

7) CLI inference with cache-aware model prediction

Environment Variables

Local Data Assumptions

Testing

Limitations / TODOs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
app		app
artifacts		artifacts
data		data
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.html		index.html
logo.png		logo.png
requirements-py313.txt		requirements-py313.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Salmon Water Risk Backend (Hackathon MVP)

What Is Implemented Now

Real MVP functionality

Scaffolded (future-facing)

Architecture

Cache Strategy (3 Layers)

Deterministic Cache Key

Eviction Policy

Data Flow

Dataset-First Training Flow

Risk Thresholds

Setup

API Endpoints

Example Calls

1) Ingest local Sentinel-like scene

2) Process scene to produce artifacts and grid

3) Fetch heatmap-ready GeoJSON

3b) Predict with trained model and then fetch model tiles

4) Build weak-label dataset sample

5) Tile dataset for training

6) Train baseline segmentation model

7) CLI inference with cache-aware model prediction

Environment Variables

Local Data Assumptions

Testing

Limitations / TODOs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages