The loader.py script loads tabular data into Elasticsearch from CSV/TSV files. It performs validation using rules defined in data/columns.csv and data presence in data/traits.csv.
The script supports three loading modes:
machine: for loading machine observation datain_situ: for in_situ observationsherbarium: for herbarium record data
Each mode uses specific required fields defined in columns.csv.
usage: loader.py [-h] --mode {machine,in_situ,herbarium} [--test] [--strict] [--batch-size BATCH_SIZE] [--progress-every PROGRESS_EVERY] data_dir drop_existing
loader.py: error: the following arguments are required: data_dir, drop_existing, --modePositional
data_dir Directory containing CSV files to load.
Options
--mode {machine,in_situ,herbarium} (required)
--drop-existing / --no-drop-existing (default: --no-drop-existing)
--test Test mode (no ES insert).
--strict Reject rows with invalid field values after coercion.
--batch-size N Docs per bulk request (default: 5000).
--progress-every N Print progress every N rows (default: 50000).
# here is an example load script
python loader.py --mode=machine data/annotations.07.25.2025/ --no-drop-existing --batch-size 5000 --progress-every 50000
python loader.py --mode=in_situ data/npn.1956.01.01-2025.08.31/ --no-drop-existing --batch-size 5000 --progress-every 50000- This file contains trait mappings from ontology trait terms to a pipe delimited list of parent terms
- Defines schema for all fields that can be used in the Elasticsearch index.
- Contains columns:
field: The name of the field in the data.datatype: The expected type (e.g.,text,integer,float,boolean,date,keyword, etc.)machine_required,inat_required,herbarium_required: Indicates if the field is required for a given mode.
- Used for two purposes:
- Validating presence of required fields.
- Building Elasticsearch mappings dynamically.
A per-dataset YAML file for applying simple value transformations before ingestion. If transform.yaml is present in the data_dir, it is loaded automatically. Only the trait field is currently transformed using this mechanism.
Format:
trait_mappings:
green leaves present: non-senescing unfolded true leaves present
senescent leaves: senescing leaves present
red leaves: colored leaves (non-green)
If a value in the trait column matches a key in trait_mappings (case-insensitive), it is replaced by the corresponding value before validation or Elasticsearch indexing.
This allows for normalizing heterogeneous trait values across datasets without modifying the main loader script.
- The script uses
columns.csvto generate the index mapping. - If
--drop_indexis passed, the script deletes the existing index and re-creates it using the generated mapping.
- Rows missing required fields or containing invalid values are logged.
- A summary count of invalid rows is displayed after loading.
- Python 3.8+
- Elasticsearch running locally or remotely (endpoint configured in script or via
.envfile) pandas,elasticsearch,python-dotenv
Install dependencies:
pip install -r requirements.txt- The index name is determined by mode (e.g.,
inat-records,machine-records, etc.) - Validation logic may be extended by modifying the script.
- Ensure that
columns.csvandtraits.csvare present in the working directory or specified via--data_dir.
PhenoBase Project | Biocode, LLC