Skip to content

Phenobase/phenobase_data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Elasticsearch Loader Script

Overview

The loader.py script loads tabular data into Elasticsearch from CSV/TSV files. It performs validation using rules defined in data/columns.csv and data presence in data/traits.csv.

The script supports three loading modes:

  • machine: for loading machine observation data
  • in_situ: for in_situ observations
  • herbarium: for herbarium record data

Each mode uses specific required fields defined in columns.csv.


Usage

usage: loader.py [-h] --mode {machine,in_situ,herbarium} [--test] [--strict] [--batch-size BATCH_SIZE] [--progress-every PROGRESS_EVERY] data_dir drop_existing
loader.py: error: the following arguments are required: data_dir, drop_existing, --mode

Options

Positional

data_dir Directory containing CSV files to load.

Options

--mode {machine,in_situ,herbarium} (required)
--drop-existing / --no-drop-existing (default: --no-drop-existing)
--test Test mode (no ES insert).
--strict Reject rows with invalid field values after coercion.
--batch-size N Docs per bulk request (default: 5000).
--progress-every N Print progress every N rows (default: 50000).

Example

# here is an example load script
python loader.py --mode=machine data/annotations.07.25.2025/ --no-drop-existing --batch-size 5000 --progress-every 50000
python loader.py --mode=in_situ data/npn.1956.01.01-2025.08.31/ --no-drop-existing --batch-size 5000 --progress-every 50000

Under the Hood

traits.csv

  • This file contains trait mappings from ontology trait terms to a pipe delimited list of parent terms

columns.csv

  • Defines schema for all fields that can be used in the Elasticsearch index.
  • Contains columns:
    • field: The name of the field in the data.
    • datatype: The expected type (e.g., text, integer, float, boolean, date, keyword, etc.)
    • machine_required, inat_required, herbarium_required: Indicates if the field is required for a given mode.
  • Used for two purposes:
    1. Validating presence of required fields.
    2. Building Elasticsearch mappings dynamically.

transform.yaml (Optional)

A per-dataset YAML file for applying simple value transformations before ingestion. If transform.yaml is present in the data_dir, it is loaded automatically. Only the trait field is currently transformed using this mechanism.

Format:

trait_mappings:
  green leaves present: non-senescing unfolded true leaves present
  senescent leaves: senescing leaves present
  red leaves: colored leaves (non-green)

If a value in the trait column matches a key in trait_mappings (case-insensitive), it is replaced by the corresponding value before validation or Elasticsearch indexing.

This allows for normalizing heterogeneous trait values across datasets without modifying the main loader script.

Elasticsearch Mapping

  • The script uses columns.csv to generate the index mapping.
  • If --drop_index is passed, the script deletes the existing index and re-creates it using the generated mapping.

Error Reporting

  • Rows missing required fields or containing invalid values are logged.
  • A summary count of invalid rows is displayed after loading.

Requirements

  • Python 3.8+
  • Elasticsearch running locally or remotely (endpoint configured in script or via .env file)
  • pandas, elasticsearch, python-dotenv

Install dependencies:

pip install -r requirements.txt

Notes

  • The index name is determined by mode (e.g., inat-records, machine-records, etc.)
  • Validation logic may be extended by modifying the script.
  • Ensure that columns.csv and traits.csv are present in the working directory or specified via --data_dir.

Author

PhenoBase Project | Biocode, LLC

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •