Skip to content

Latest commit

 

History

History
194 lines (142 loc) · 18.6 KB

File metadata and controls

194 lines (142 loc) · 18.6 KB

Agribound Examples

This directory contains example scripts and Jupyter notebooks demonstrating agribound's capabilities across different continents, satellite sources, and delineation engines.

Prerequisites

  1. Install agribound with the required extras for the example you want to run:
conda create -n agribound python=3.12 rasterio geopandas fiona shapely pyproj -c conda-forge
conda activate agribound
pip install -e ".[all]"
  1. For GEE-based examples (all except 10_local_tif_quickstart.py and 16_usa_usgs_naip_plus.py), authenticate with Google Earth Engine:
gcloud config set project YOUR_GEE_PROJECT
earthengine authenticate
agribound auth --project YOUR_GEE_PROJECT

See the GEE Setup guide for details.

Running an Example

Python Scripts

All GEE-based examples require a --gee-project argument:

python examples/01_new_mexico_landsat_timeseries.py --gee-project YOUR_GEE_PROJECT

The local TIF and USGS NAIP Plus examples do not require GEE:

python examples/10_local_tif_quickstart.py
python examples/16_usa_usgs_naip_plus.py

Jupyter Notebooks

Interactive notebook versions are available in the notebooks/ directory. Set the GEE_PROJECT variable in the first code cell of each notebook before running:

cd examples/notebooks
jupyter lab

Outputs (GeoPackage files and HTML maps) are saved to outputs/<example_name>/.

Scripts

# Script Region Satellite Engine Est. Runtime Description
01 01_new_mexico_landsat_timeseries.py New Mexico, USA Landsat 5--9 delineate-anything ~8--12 h 40-year annual field boundaries (1985--2025). Fine-tunes on NMOSE reference boundaries and evaluates per-year accuracy. Best run on HPC/cloud with GPU.
02 02_india_ganges_sentinel2.py Nadia District (West Bengal), India Sentinel-2 + Google + TESSERA + SPOT Pan ftw + embedding + DA ~15--30 min Compares FTW (supervised, S2), Google AlphaEarth (64-D) and TESSERA (128-D) embeddings (unsupervised), and SPOT panchromatic (1.5 m, restricted) for smallholder rice field delineation (2024).
03 03_australia_murray_darling_hls.py Murray-Darling Basin, Australia HLS prithvi ~45--90 min Compares Prithvi ViT embeddings (full encoder) vs PCA baseline on large-scale irrigated agriculture. Runs 2022--2024.
04 04_france_beauce_sentinel2.py Beauce, France Sentinel-2 ftw ~15--30 min European large-field agriculture using FTW's pre-trained models (covers France). Single year (2023).
05 05_pampas_embeddings.py Argentine Pampas (Pergamino) Google + TESSERA embedding ~5--10 min CPU-only unsupervised clustering from pre-computed satellite embeddings (64-D Google, 128-D TESSERA). ~50 km bbox over the Pampas agricultural heartland (2020).
06 06_kenya_smallholder_ftw.py Central Kenya Sentinel-2 ftw ~10--20 min Demonstrates min_field_area tuning for smallholder agriculture. Compares results at 100, 500, 1000, and 2500 m2 thresholds.
07 07_usa_naip_high_res.py Central Valley, California, USA NAIP delineate-anything ~20--40 min 1 m resolution field extraction from NAIP imagery. Large commercial fields.
08 08_china_north_plain_spot.py North China Plain SPOT 6/7 delineate-anything ~15--30 min 6 m resolution SPOT imagery. Restricted access -- see note below.
09 09_ensemble_comparison.py Andalusia, Spain Sentinel-2 ensemble ~30--60 min Runs delineate-anything and FTW on the same AOI, then vote-merges for ensemble consensus. Visualizes per-engine and consensus results.
10 10_local_tif_quickstart.py User-provided Local GeoTIFF delineate-anything ~2--5 min Minimal 5-line quickstart using a local file. No GEE required. Edit LOCAL_TIF and STUDY_AREA paths before running.
11 11_mississippi_alluvial_plain_spot.py Mississippi Alluvial Plain, USA SPOT 6/7 delineate-anything ~15--30 min SPOT-based delineation of row-crop agriculture (2021--2023). Includes cross-year stability analysis using IoU/F1. Restricted access -- see note below.
12 12_new_mexico_ensemble_timeseries.py Eastern Lea County, NM, USA All (Sentinel-2, Landsat, HLS, NAIP, SPOT, Google & TESSERA embeddings) All (per-source ensemble) ~3--6 h Multi-model per-source ensemble (2024) over ~20 km center pivot area. Runs all engines per sensor and vote-merges within each source (not across sensors). SAM2 refines each per-source ensemble. Best run on HPC/cloud with GPU.
13 13_sam2_refine_dinov3.py Lea County, NM, USA Sentinel-2 SAM2 refinement ~5--15 min Standalone SAM2 boundary refinement on pre-computed DINOv3 field boundaries (555 fields). Crops each field from the raster and refines with SAM2 box prompts. Compares before/after metrics against NMOSE reference.
14 14_dinov3_sam2_ensemble.py Eastern Lea County, NM, USA Sentinel-2, Landsat, HLS, NAIP, SPOT DINOv3 + SAM2 ~1--2 h Runs DINOv3 (SAT-493M) across 5 satellite sources with per-source SAM2 refinement. Compares per-source results against NMOSE reference boundaries. Uses a ~20 km bbox over the center pivot area to keep NAIP/SPOT runtimes practical.
15 15_pampas_semi_supervised.py Pampas (Pergamino), Argentina Google + TESSERA embeddings + Dynamic World + Sentinel-2 Embedding + SAM2 (no training) ~15--30 min Fully automated pipeline requiring no reference boundaries or training. Clusters Google (64-D) and TESSERA (128-D) embeddings, LULC-filters to crops, then refines with SAM2 using both S2 and TESSERA native bands. Includes improved SAM2 with geometry fixes, polygon exploding, and large-field separation. TESSERA produces more accurate boundaries than Google (see Embedding Comparison). GPU recommended.
16 16_usa_usgs_naip_plus.py Central Valley, California, USA USGS NAIP Plus ImageServer delineate-anything ~30--60 min First community contribution! High-resolution field extraction using the non-GEE usgs-naip-plus source -- the same NAIP imagery available on GEE but acquired directly from the USGS USGSNAIPPlus ImageServer. No GEE authentication required. Contributed by Jeremy Rapp (Michigan State University).

Notebooks

Interactive Jupyter notebook versions of each example are in the notebooks/ directory. These are designed for step-by-step exploration with inline map visualization. Set GEE_PROJECT in the first code cell before running.

# Notebook Description Key Difference from Script
01 01_new_mexico_landsat_timeseries.ipynb New Mexico Landsat time series with fine-tuning Runs 2023--2025 (3 years) instead of the full 40-year range, suitable for interactive use
02 02_india_ganges_sentinel2.ipynb India Nadia District (West Bengal): FTW vs Google vs TESSERA vs SPOT Pan Same scope as script
03 03_australia_murray_darling_hls.ipynb Australia Murray-Darling Basin: Prithvi ViT vs PCA (HLS) Same scope as script
04 04_france_beauce_sentinel2.ipynb France Beauce region (FTW) Same scope as script
05 05_pampas_embeddings.ipynb Pampas embeddings (CPU-only, Google + TESSERA) Same scope as script
06 06_kenya_smallholder_ftw.ipynb Kenya smallholder min_area tuning (FTW) Same scope as script
07 07_usa_naip_high_res.ipynb USA Central Valley NAIP 1 m (Delineate-Anything) Same scope as script
08 08_china_north_plain_spot.ipynb China North Plain SPOT 6/7 (restricted) Same scope as script
09 09_ensemble_comparison.ipynb Ensemble multi-engine comparison (Andalusia) Same scope as script
10 10_local_tif_quickstart.ipynb Local GeoTIFF quickstart (no GEE) Same scope as script
11 11_mississippi_alluvial_plain_spot.ipynb Mississippi Alluvial Plain SPOT 6/7 (restricted) Same scope as script
12 12_new_mexico_ensemble_timeseries.ipynb Lea County multi-source grand ensemble (2020--2022) Same scope as script
13 13_sam2_refine_dinov3.ipynb SAM2 boundary refinement on DINOv3 output Same scope as script
14 14_dinov3_sam2_ensemble.ipynb DINOv3 + SAM2 multi-source comparison (Eastern Lea County) Runs single year (2022) instead of 2020--2022
15 15_pampas_semi_supervised.ipynb Embedding + SAM2 (Pampas, no training required) Same scope as script
16 16_usa_usgs_naip_plus.ipynb USA Central Valley USGS NAIP Plus -- same NAIP data as GEE, from USGS ImageServer (no GEE, contributed by Jeremy Rapp) Same scope as script

Runtime Notes

  • Estimated runtimes assume a single NVIDIA GPU (e.g., A100/V100) and moderate internet speed for GEE downloads.
  • GEE composite generation adds ~2--5 minutes per year per source.
  • CPU-only runs (example 05, embedding engine) are 2--5x slower for inference but have no GPU requirement.
  • Fine-tuning (examples 01, 12) takes ~30 minutes per model on an Apple M2 Max (MPS). In example 12, DA (2 variants) and GeoAI/Prithvi are fine-tuned on NMOSE reference boundaries (~1.5 hours total). FTW uses pre-trained weights directly (fine-tuning not yet supported — FTW requires paired temporal windows). Fine-tuned checkpoints are cached and reused across years.
  • SAM2 boundary refinement (example 12) runs once on the final grand ensemble output per year. Example 14 runs SAM2 per source using each sensor's native raster for accurate per-field segmentation. With the large model and per-field cropping, refinement takes ~2--5 minutes per source per year depending on field count.
  • NAIP and SPOT over large areas: NAIP (1 m) and SPOT (6 m) produce rasters that are 100–900x larger in pixel count than Sentinel-2 (10 m) for the same study area. Inference on these high-resolution sources over county-scale or larger areas can take hours even on GPU. Consider subsetting the study area or using tile_size to process in chunks. Fine-tuning on NAIP/SPOT is also significantly slower due to the larger training chips.
  • Apple Silicon (MPS): The GeoAI engine (Mask R-CNN) crashes on MPS due to Metal command buffer errors. Agribound automatically falls back to CPU for GeoAI training and inference. All other engines (FTW, Delineate-Anything, Prithvi) work correctly on MPS.
  • GeoAI requires fine-tuning: Without fine-tuning on region-specific reference boundaries, GeoAI's Mask R-CNN typically does not delineate any fields. For out-of-the-box delineation without reference data, use FTW (pre-trained models for 25 countries) or Delineate-Anything (resolution-agnostic).
  • The 40-year New Mexico script (01) is best run as an overnight batch job or on HPC. The notebook version runs only 2023--2025.

LULC Crop Filtering

Agribound automatically filters detected field boundaries to agricultural areas using land-use/land-cover (LULC) data. This is enabled by default (lulc_filter=True) and removes non-agricultural polygons (roads, water, forest, urban areas, etc.) from the output.

The appropriate LULC dataset is selected automatically based on the study area location and target year:

Region Dataset Years Resolution Crop Classes
CONUS USGS Annual NLCD 1985–2024 (nearest year) 30 m 81 (Pasture/Hay), 82 (Cultivated Crops)
Global, ≥2015 Google Dynamic World 2015–present (nearest year) 10 m crops probability band
Global, <2015 Copernicus C3S Land Cover 1992–2022 (nearest year) 300 m 10, 20, 30 (Cropland classes)

Configuration:

  • lulc_filter=True (default) — enable crop filtering
  • lulc_filter=False — disable (used for local files without GEE, or unsupervised embedding clusters)
  • lulc_crop_threshold=0.3 (default) — minimum fraction of crop pixels to keep a polygon

Disabled by default for:

  • Example 05 (unsupervised embedding clusters — no semantic meaning)
  • Example 10 (local GeoTIFF — no GEE access)
  • Example 16 (USGS NAIP Plus — purely non-GEE workflow)

SPOT Access

Examples 08 and 11 use SPOT 6/7 imagery, which is restricted to select GEE users under a data-sharing agreement. This source is primarily for internal DRI use. If you receive an access error, contact the agribound author (sayantan.majumdar@dri.edu) to request field boundary processing for your study area.

When to Use Ensembles

Ensembles work best when multiple models are run on the same sensor data. Each model architecture (DA, FTW, GeoAI, DINOv3, Prithvi) has different biases — vote-merging across models cancels out individual errors because every model sees the same pixels but interprets them differently.

Ensembles across different sensors (e.g., Sentinel-2 + Landsat + NAIP) do not work well because:

  • Resolution mismatch — a 1 m NAIP polygon and a 30 m Landsat polygon for the same field have different shapes, producing poor vote overlap
  • Temporal mismatch — each sensor captures different dates, so field states (bare vs cropped) may differ
  • Spatial alignment — sub-pixel registration errors between sensors create artificial disagreements at boundaries

For multi-sensor analysis, compare per-source results independently (example 14) rather than merging them. The multi-model ensemble (example 12) runs all engines on the same eastern Lea County area for this reason.

Recommended Approaches

  • With reference boundaries: DINOv3 + SAM2 per source (example 14). DINOv3's SAT-493M backbone fine-tunes well on each sensor with just 10--30 epochs.
  • Without reference boundaries: Embedding clustering + LULC filter + SAM2 (example 15). TESSERA embeddings produce more accurate boundaries than Google (see below). No training required.
  • Multi-model ensemble: Example 12 runs all engines on the same sensor and merges via majority vote. Best accuracy but slowest.

Embedding Comparison: Google vs TESSERA (Example 15)

Testing over the Argentine Pampas shows that TESSERA embeddings produce more accurate field boundaries than Google AlphaEarth Embeddings when used with the automated pipeline (embedding clustering + LULC filter + SAM2). The two embedding products differ fundamentally in architecture and input data:

  • TESSERA (Feng et al., 2025) is a pixel-wise foundation model trained on multi-modal Sentinel-1/2 time series using Barlow Twins self-supervision. It processes "d-pixels" — full temporal sequences of all spectral bands (S2) and SAR backscatter (S1) at each pixel — learning 128-D embeddings that are invariant to cloud-induced temporal gaps. Because it encodes the complete phenological trajectory (planting, growth, senescence) rather than a single composite, adjacent fields with different crop types, planting dates, or irrigation schedules produce distinct embeddings even when they appear spectrally similar in any single image.

  • Google AlphaEarth (Brown et al., 2025) uses a video summarization architecture with a "Space Time Precision" encoder that assimilates multiple EO sources into 64-D annual embeddings on the unit sphere S63. While it also incorporates temporal information through its support/valid period design, the released annual embedding fields are static temporal summaries that compress a full year into a single vector. The architecture prioritizes generality across diverse mapping tasks (land cover, biomass, evapotranspiration) rather than fine-grained agricultural phenology.

  • Why TESSERA produces better field boundaries: TESSERA's explicit modeling of temporal sampling invariance — training on random 40-observation subsets from the annual S1/S2 time series — makes it particularly sensitive to within-season crop dynamics. Two soybean fields planted two weeks apart produce different temporal profiles that TESSERA preserves in its embeddings. Google's annual summary tends to average out these intra-seasonal differences, causing adjacent fields with similar average reflectance to merge into single clusters.

  • Trade-offs: Google AlphaEarth has global coverage for 2017--2024 and is available as a GEE ImageCollection. TESSERA coverage varies by region/year (2017--2025) and requires the geotessera library for tile download and mosaicking.

For new study areas without reference boundaries, we recommend the example 15 pipeline with TESSERA embeddings where available, falling back to Google embeddings for global coverage.

NMOSE Reference Data

Examples 01, 12, 13, and 14 use NMOSE (New Mexico Office of the State Engineer) WUCB agricultural polygon boundaries for fine-tuning and/or evaluation. Examples 12 and 14 filter to eastern Lea County (County 25). Example 13 uses pre-computed DINOv3 boundaries from Lea County for standalone SAM2 refinement. The NMOSE shapefile is not included in the public repository — contact the agribound author (sayantan.majumdar@dri.edu) for access.

Output Structure

Each example creates an output directory under outputs/:

outputs/
├── new_mexico_timeseries/
│   ├── fields_landsat_1985.gpkg
│   ├── fields_landsat_1986.gpkg
│   ├── ...
│   ├── map_predicted_vs_reference.html
│   ├── map_timeseries_comparison.html
│   └── map_latest.html
├── india_nadia/
│   ├── fields_ftw_s2_2024.gpkg
│   ├── fields_google_2024.gpkg
│   ├── fields_tessera_2024.gpkg
│   ├── fields_spot_pan_2023.gpkg
│   └── map_ftw_vs_tessera.html
└── ...
  • .gpkg files contain field boundary polygons with area, perimeter, and provenance metadata.
  • .html files are standalone interactive maps (open in any browser) showing field boundaries overlaid on satellite basemaps.