TROWEL - a Tool for Retrieving, Organizing, and Wrangling Ecological science data and Labels
TROWEL is a command-line toolkit for working with ecological science data and ontologies, particularly BERVO. It provides tools for retrieving datasets, extracting variables, matching terms, and analyzing term relationships using LLM embeddings.
- Dataset Metadata Retrieval - Fetch metadata from ESS-DIVE datasets
- Variable Extraction - Extract variable names and definitions from datasets
- Term Matching - Match terms between files using exact and fuzzy matching
- Embedding-Based Analysis - Analyze term relationships using LLM embeddings
- Clustering & Visualization - Visualize term clusters using PCA and t-SNE
- Cross-Dataset Comparison - Find similar terms across different ontologies and datasets
- Python 3.10+
- Poetry (for dependency management)
git clone https://github.com/bioepic-data/trowel.git
cd trowel
poetry installFor full embedding analysis capabilities, install optional dependencies:
pip install matplotlib seaborn scikit-learn scipyFor DuckDB support and CurateGPT integration:
pip install duckdb curategpttrowel COMMAND [OPTIONS] [ARGUMENTS]All commands include help text:
trowel --help # Show all available commands
trowel COMMAND --help # Show help for a specific commandCommands for downloading ontology and dataset data.
Download the BERVO (Biogeochemical and Ecological Processes Ontology) ontology.
trowel get-bervo -o bervo.csvOptions:
-o, --output TEXT- Output file path (default: bervo.csv)
Output:
- CSV file containing the complete BERVO ontology with all terms, definitions, units, and category information
This command downloads the official BERVO ontology from Google Sheets and is useful for preparing the ontology for downstream analysis with other trowel commands.
Commands for working with ESS-DIVE environmental science datasets.
Requires: ESSDIVE_TOKEN environment variable (get access)
Retrieve metadata from ESS-DIVE datasets by DOI.
trowel get-essdive-metadata \
-p dois.txt \
-o ./outputOptions:
-p, --path TEXT- Path to file with DOIs (one per line)-o, --outpath TEXT- Output directory (default: current directory)
Output:
results.tsv- Dataset metadatafrequencies.txt- Variable frequency statisticsfiletable.tsv- List of files from all datasets
Extract variable names from dataset files and data dictionaries.
trowel get-essdive-variables \
-p filetable.tsv \
-o ./output \
-w 10Options:
-p, --path TEXT- Path to filetable.tsv (auto-detected if not provided)-o, --outpath TEXT- Output directory-w, --workers INTEGER- Number of parallel workers (default: 10)
Output:
variable_names.tsv- Extracted variable names and metadatadata_dictionaries.tsv- Compiled data dictionary entries
Commands for matching terms between files.
Match terms from a TSV file against a list of terms.
trowel match-term-lists \
-t terms.tsv \
-l target_list.txt \
-o results.tsv \
-f \
-s 80Options:
-t, --terms-file TEXT- TSV file with terms in first column (required)-l, --list-file TEXT- Text file with terms, one per line (required)-o, --output TEXT- Output file path-f, --fuzzy- Enable fuzzy matching with Levenshtein distance-s, --similarity-threshold FLOAT- Similarity threshold 0-100 (default: 80)
Output: TSV file with original terms plus a new column indicating matches.
Commands for analyzing term relationships using LLM embeddings. These commands work with embeddings generated by CurateGPT.
Prepare a CSV file for embedding by selecting specific columns.
trowel embeddings prepare-embeddings \
-i bervo.csv \
-o bervo_prepared.csv \
-c 0,1,6,12 \
--skip-rows 1Options:
-i, --input TEXT- Input CSV file (required)-o, --output TEXT- Output CSV file (required)-c, --columns TEXT- Comma-separated column indices (0-indexed, required)--skip-rows INTEGER- Number of header rows to skip (default: 0)
Generate vector embeddings for CSV data using CurateGPT and OpenAI's embedding API.
This command handles the complete embedding pipeline: reading your prepared data, calling OpenAI's text-embedding-ada-002 model for each row, storing embeddings in a DuckDB database, and optionally exporting results to CSV for downstream analysis.
Requirements:
OPENAI_API_KEYenvironment variable must be set- CurateGPT installed:
pip install curategpt - DuckDB installed:
pip install duckdb
# Basic usage - embed a prepared file
trowel embeddings generate-embeddings -i bervo_prepared.csv
# Specify collection and database paths
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-c bervo \
-d backup/bervo.duckdb
# Test with subset of rows
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-l 1000
# Specify which columns to use for embeddings
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-f "id,label,definition"
# Generate and export embeddings for use with other commands
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-e backup/bervo_embeds.csvOptions:
-i, --input TEXT- Path to prepared CSV file (required)-c, --collection TEXT- Collection name in DuckDB (default: "embeddings")-d, --db-path TEXT- Path to DuckDB file for storage (default: "./backup/db.duckdb")-f, --text-fields TEXT- Comma-separated column names to use for embeddings. If not specified, uses all columns-l, --limit INTEGER- Maximum rows to embed (useful for testing large files)-s, --skip INTEGER- Number of rows to skip from beginning-e, --export TEXT- Optional: export embeddings to CSV file after generation
Output:
- DuckDB database stored at
--db-pathlocation (default:./backup/db.duckdb) - If
--exportspecified: CSV file with embeddings for use with other commands
Note on Costs:
Each embedding call uses OpenAI's API and incurs a small cost. The text-embedding-ada-002 model is one of the most affordable OpenAI models. For BERVO (5000+ terms), expect costs of a few dollars.
Load embeddings and compute statistics.
trowel embeddings load-embeddings \
-e bervo_embeds.csv \
-o ./analysisOptions:
-e, --embeddings TEXT- Embedding CSV file from CurateGPT (required)-o, --output TEXT- Output directory (default: current directory)
Output:
embedding_stats.json- Statistics (count, dimension, similarity metrics)
Note: This command generates new embeddings. To reuse pre-computed embeddings stored in the backup/ directory or elsewhere, use the other embedding commands (find-similar, visualize-clusters, etc.) which have built-in support for loading from saved files.
Find the most similar terms to a query term.
# Load from current directory
trowel embeddings find-similar \
-e bervo_embeds.csv \
-q BERVO:0000026 \
-n 20 \
-o results.txt
# Load from backup/ directory (auto-discovered)
trowel embeddings find-similar \
-e bervo_embeds.csv \
-q BERVO:0000026 \
-n 20
# Load only first 5000 terms from large file
trowel embeddings find-similar \
-e bervo_embeds.csv \
-q BERVO:0000026 \
-n 20 \
-l 5000Options:
-e, --embeddings TEXT- Path to embedding CSV file. Can be:- Full path:
/path/to/embeddings.csv - Filename:
bervo_embeds.csv(searches current dir andbackup/) - Without extension:
bervo_embeds(auto-adds.csvand checksbackup/)
- Full path:
-q, --query TEXT- Query term ID or label (required)-n, --top-n INTEGER- Number of results (default: 10)-l, --limit INTEGER- Maximum number of terms to load (optional, useful for large files)-s, --skip INTEGER- Number of terms to skip from beginning (default: 0)-o, --output TEXT- Output file for results (optional)
Embedding File Locations:
- Current working directory:
bervo_embeds.csv - Backup directory:
backup/bervo_embeds.csv(auto-discovered) - Can also pre-generate and store embeddings in
backup/for reuse across sessions
Create 2D visualizations of term clusters.
# Fast PCA visualization
trowel embeddings visualize-clusters \
-e bervo_embeds.csv \
-m pca \
-o clusters_pca.png
# Better quality t-SNE visualization (slower)
trowel embeddings visualize-clusters \
-e bervo_embeds.csv \
-m tsne \
-o clusters_tsne.png
# Load from backup/ and use only first 5000 terms
trowel embeddings visualize-clusters \
-e bervo_embeds.csv \
-m pca \
-o clusters_pca.png \
-l 5000Options:
-e, --embeddings TEXT- Path to embedding CSV file. Can be:- Full path:
/path/to/embeddings.csv - Filename:
bervo_embeds.csv(searches current dir andbackup/) - Without extension:
bervo_embeds(auto-adds.csvand checksbackup/)
- Full path:
-m, --method [pca|tsne]- Dimensionality reduction method (default: pca)-l, --limit INTEGER- Maximum number of terms to load (optional, useful for large files)-s, --skip INTEGER- Number of terms to skip from beginning (default: 0)-o, --output TEXT- Output PNG file--label-interval INTEGER- Interval for labeling points (default: 100)
Visualize clusters colored by ontology category.
trowel embeddings visualize-by-category \
-s bervo.csv \
-e bervo_embeds.csv \
-o clusters_categorical.pngOptions:
-s, --source-csv TEXT- Source CSV with category information (required)-e, --embeddings TEXT- Embedding CSV file (required)-o, --output TEXT- Output PNG file--label-interval INTEGER- Interval for labeling points (default: 100)
Create similarity heatmap for a subset of terms.
trowel embeddings visualize-heatmap \
-e bervo_embeds.csv \
-n 50 \
-o similarity_heatmap.pngOptions:
-e, --embeddings TEXT- Embedding CSV file (required)-n, --num-terms INTEGER- Number of terms to include (default: 50)-o, --output TEXT- Output PNG file
Find similar term pairs between two collections.
trowel embeddings cross-collection-similarity \
-b bervo_embeds.csv \
-n new_vars_embeds.csv \
-t 25 \
-o matches.txtOptions:
-b, --bervo-embeddings TEXT- First embedding file (required)-n, --new-embeddings TEXT- Second embedding file (required)-t, --top-n INTEGER- Number of top pairs (default: 25)-o, --output TEXT- Output file for results (optional)
# 0. Download BERVO ontology
trowel get-bervo -o bervo.csv
# 1. Get dataset metadata
trowel get-essdive-metadata -p dois.txt -o ./data
# 2. Extract variables from datasets
trowel get-essdive-variables -p ./data/filetable.tsv -o ./data
# 3. Match variables to BERVO terms
trowel match-term-lists \
-t bervo.csv \
-l ./data/variable_names.txt \
-o matched_variables.tsv \
-f# 1. Download BERVO ontology
trowel get-bervo -o bervo.csv
# 2. Prepare data for embedding
trowel embeddings prepare-embeddings \
-i bervo.csv \
-o bervo_prepared.csv \
-c 0,1,6,12 \
--skip-rows 1
# 3. Generate embeddings (using OpenAI API via CurateGPT)
# This generates embeddings and saves them to backup/ for reuse
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-c bervo \
-d backup/bervo.duckdb \
-e backup/bervo_embeds.csv
# 4. Find similar terms
trowel embeddings find-similar \
-e bervo_embeds.csv \
-q BERVO:0000026 \
-n 20 \
-o similar_terms.txt
# 5. Create visualizations
trowel embeddings visualize-clusters \
-e bervo_embeds.csv \
-m pca \
-o clusters_pca.png
trowel embeddings visualize-by-category \
-s bervo.csv \
-e bervo_embeds.csv \
-o clusters_by_category.png# 1. Download BERVO for comparison (optional)
trowel get-bervo -o ontology1.csv
# 2. Prepare both ontologies
trowel embeddings prepare-embeddings \
-i ontology1.csv \
-o ontology1_prepared.csv \
-c 0,1,6 --skip-rows 1
trowel embeddings prepare-embeddings \
-i ontology2.csv \
-o ontology2_prepared.csv \
-c 0,1,6 --skip-rows 1
# 3. Generate embeddings for both (using OpenAI API via CurateGPT)
trowel embeddings generate-embeddings \
-i ontology1_prepared.csv \
-c ontology1 \
-d backup/ontology1.duckdb \
-e backup/ont1_embeds.csv
trowel embeddings generate-embeddings \
-i ontology2_prepared.csv \
-c ontology2 \
-d backup/ontology2.duckdb \
-e backup/ont2_embeds.csv
# 4. Compare the ontologies
trowel embeddings cross-collection-similarity \
-b backup/ont1_embeds.csv \
-n backup/ont2_embeds.csv \
-t 50 \
-o ontology_comparison.txtPre-computed embeddings can be stored in the backup/ directory and automatically discovered by TROWEL's embedding commands. This allows you to generate embeddings once and reuse them across many analysis sessions without regenerating them.
- Generate embeddings once using the integrated command and save to
backup/:
# Generate embeddings using trowel (handles OpenAI API calls internally)
# This saves both the DuckDB database AND exports to CSV
trowel embeddings generate-embeddings \
-i bervo_prepared.csv \
-c bervo \
-d backup/bervo.duckdb \
-e backup/bervo_embeds.csv- Reuse embeddings for analysis (no regeneration needed):
# Auto-finds in backup/
trowel embeddings find-similar -e bervo_embeds.csv -q BERVO:0000026
# Work with subsets of large files
trowel embeddings visualize-clusters -e bervo_embeds.csv -m pca -l 5000 -o viz.png
# All searches check: current dir → backup/ → auto-add .csv → backup/ with .csv- Separation: Keeps generated embeddings separate from source data
- Reusability: Multiple workflows can reference the same embeddings
- Auto-discovery: Commands automatically search
backup/for files - Backup: Easy to version control and backup important embedding results
When you specify an embedding file with -e bervo_embeds.csv, TROWEL searches:
- Current working directory:
bervo_embeds.csv - Backup directory:
backup/bervo_embeds.csv - Current dir with .csv:
bervo_embeds.csv(if you specifybervo_embeds) - Backup with .csv:
backup/bervo_embeds.csv(if you specifybervo_embeds)
This allows you to:
- Pre-compute embeddings and store them once in
backup/ - Reference them by simple filename from anywhere in your workflow
- Work with multiple embedding versions without specifying full paths
TROWEL is specifically designed to work with the BERVO ontology, a comprehensive biogeochemical ontology for environmental science variables. While TROWEL may work with other ontologies, optimal results are achieved with BERVO due to its well-structured hierarchy and comprehensive variable definitions.
BERVO includes:
- Standardized variable definitions
- Clear hierarchical organization (categories starting with BERVO:9)
- Units and measurement specifications
- Integration with ESS-DIVE datasets
Download the latest BERVO ontology directly from the official source:
trowel get-bervo -o bervo.csvThis downloads the latest BERVO from Google Sheets, making it easy to keep your ontology current.
# Required for ESS-DIVE commands
export ESSDIVE_TOKEN=your_token_here
# Required for embedding generation (CurateGPT)
export OPENAI_API_KEY=your_api_key_here- Visit https://docs.ess-dive.lbl.gov/programmatic-tools/ess-dive-dataset-api#get-access
- Follow the authentication instructions
- Set your token as shown above
- Sign up for an OpenAI account at https://openai.com
- Navigate to the API keys section
- Create a new API key
- Set as shown above
- Python 3.10+
- click (CLI framework)
- requests (HTTP client)
- polars (data processing)
- tqdm (progress bars)
- openpyxl, xlrd (spreadsheet support)
- numpy, scipy (numerical operations)
- matplotlib, seaborn (visualization)
- scikit-learn (dimensionality reduction)
- duckdb (database access)
- curategpt (embedding generation)
Install optional dependencies:
pip install matplotlib seaborn scikit-learn scipy duckdbexport ESSDIVE_TOKEN=your_token_here- Verify the exact spelling of the query term
- Check that the term exists in your embedding file
- Use PCA instead of t-SNE for quick previews
- Reduce the number of labeled points with
--label-interval
- Process datasets in batches
- Use smaller subsets for visualization
- Consider using PCA for dimensionality reduction
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request
BSD-3-Clause - See LICENSE file for details
If you use TROWEL in your research, please cite:
@software{trowel,
title = {TROWEL: Tool for Retrieving, Organizing, and Wrangling Ecological Labels},
url = {https://github.com/bioepic-data/trowel},
year = {2024}
}For issues, questions, or suggestions, please open an issue on GitHub: https://github.com/bioepic-data/trowel/issues
