Skip to content

bioepic-data/trowel

Repository files navigation

TROWEL

TROWEL Logo

TROWEL - a Tool for Retrieving, Organizing, and Wrangling Ecological science data and Labels

TROWEL is a command-line toolkit for working with ecological science data and ontologies, particularly BERVO. It provides tools for retrieving datasets, extracting variables, matching terms, and analyzing term relationships using LLM embeddings.

Features

  • Dataset Metadata Retrieval - Fetch metadata from ESS-DIVE datasets
  • Variable Extraction - Extract variable names and definitions from datasets
  • Term Matching - Match terms between files using exact and fuzzy matching
  • Embedding-Based Analysis - Analyze term relationships using LLM embeddings
  • Clustering & Visualization - Visualize term clusters using PCA and t-SNE
  • Cross-Dataset Comparison - Find similar terms across different ontologies and datasets

Installation

Requirements

  • Python 3.10+
  • Poetry (for dependency management)

Quick Setup

git clone https://github.com/bioepic-data/trowel.git
cd trowel
poetry install

For full embedding analysis capabilities, install optional dependencies:

pip install matplotlib seaborn scikit-learn scipy

For DuckDB support and CurateGPT integration:

pip install duckdb curategpt

Usage

Command Structure

trowel COMMAND [OPTIONS] [ARGUMENTS]

All commands include help text:

trowel --help              # Show all available commands
trowel COMMAND --help      # Show help for a specific command

Commands Overview

1. Data Download

Commands for downloading ontology and dataset data.

get-bervo

Download the BERVO (Biogeochemical and Ecological Processes Ontology) ontology.

trowel get-bervo -o bervo.csv

Options:

  • -o, --output TEXT - Output file path (default: bervo.csv)

Output:

  • CSV file containing the complete BERVO ontology with all terms, definitions, units, and category information

This command downloads the official BERVO ontology from Google Sheets and is useful for preparing the ontology for downstream analysis with other trowel commands.

2. ESS-DIVE Integration

Commands for working with ESS-DIVE environmental science datasets.

Requires: ESSDIVE_TOKEN environment variable (get access)

get-essdive-metadata

Retrieve metadata from ESS-DIVE datasets by DOI.

trowel get-essdive-metadata \
  -p dois.txt \
  -o ./output

Options:

  • -p, --path TEXT - Path to file with DOIs (one per line)
  • -o, --outpath TEXT - Output directory (default: current directory)

Output:

  • results.tsv - Dataset metadata
  • frequencies.txt - Variable frequency statistics
  • filetable.tsv - List of files from all datasets

get-essdive-variables

Extract variable names from dataset files and data dictionaries.

trowel get-essdive-variables \
  -p filetable.tsv \
  -o ./output \
  -w 10

Options:

  • -p, --path TEXT - Path to filetable.tsv (auto-detected if not provided)
  • -o, --outpath TEXT - Output directory
  • -w, --workers INTEGER - Number of parallel workers (default: 10)

Output:

  • variable_names.tsv - Extracted variable names and metadata
  • data_dictionaries.tsv - Compiled data dictionary entries

3. Term Matching

Commands for matching terms between files.

match-term-lists

Match terms from a TSV file against a list of terms.

trowel match-term-lists \
  -t terms.tsv \
  -l target_list.txt \
  -o results.tsv \
  -f \
  -s 80

Options:

  • -t, --terms-file TEXT - TSV file with terms in first column (required)
  • -l, --list-file TEXT - Text file with terms, one per line (required)
  • -o, --output TEXT - Output file path
  • -f, --fuzzy - Enable fuzzy matching with Levenshtein distance
  • -s, --similarity-threshold FLOAT - Similarity threshold 0-100 (default: 80)

Output: TSV file with original terms plus a new column indicating matches.

4. Embedding-Based Analysis

Commands for analyzing term relationships using LLM embeddings. These commands work with embeddings generated by CurateGPT.

prepare-embeddings

Prepare a CSV file for embedding by selecting specific columns.

trowel embeddings prepare-embeddings \
  -i bervo.csv \
  -o bervo_prepared.csv \
  -c 0,1,6,12 \
  --skip-rows 1

Options:

  • -i, --input TEXT - Input CSV file (required)
  • -o, --output TEXT - Output CSV file (required)
  • -c, --columns TEXT - Comma-separated column indices (0-indexed, required)
  • --skip-rows INTEGER - Number of header rows to skip (default: 0)

generate-embeddings

Generate vector embeddings for CSV data using CurateGPT and OpenAI's embedding API.

This command handles the complete embedding pipeline: reading your prepared data, calling OpenAI's text-embedding-ada-002 model for each row, storing embeddings in a DuckDB database, and optionally exporting results to CSV for downstream analysis.

Requirements:

  • OPENAI_API_KEY environment variable must be set
  • CurateGPT installed: pip install curategpt
  • DuckDB installed: pip install duckdb
# Basic usage - embed a prepared file
trowel embeddings generate-embeddings -i bervo_prepared.csv

# Specify collection and database paths
trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -c bervo \
  -d backup/bervo.duckdb

# Test with subset of rows
trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -l 1000

# Specify which columns to use for embeddings
trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -f "id,label,definition"

# Generate and export embeddings for use with other commands
trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -e backup/bervo_embeds.csv

Options:

  • -i, --input TEXT - Path to prepared CSV file (required)
  • -c, --collection TEXT - Collection name in DuckDB (default: "embeddings")
  • -d, --db-path TEXT - Path to DuckDB file for storage (default: "./backup/db.duckdb")
  • -f, --text-fields TEXT - Comma-separated column names to use for embeddings. If not specified, uses all columns
  • -l, --limit INTEGER - Maximum rows to embed (useful for testing large files)
  • -s, --skip INTEGER - Number of rows to skip from beginning
  • -e, --export TEXT - Optional: export embeddings to CSV file after generation

Output:

  • DuckDB database stored at --db-path location (default: ./backup/db.duckdb)
  • If --export specified: CSV file with embeddings for use with other commands

Note on Costs: Each embedding call uses OpenAI's API and incurs a small cost. The text-embedding-ada-002 model is one of the most affordable OpenAI models. For BERVO (5000+ terms), expect costs of a few dollars.

load-embeddings

Load embeddings and compute statistics.

trowel embeddings load-embeddings \
  -e bervo_embeds.csv \
  -o ./analysis

Options:

  • -e, --embeddings TEXT - Embedding CSV file from CurateGPT (required)
  • -o, --output TEXT - Output directory (default: current directory)

Output:

  • embedding_stats.json - Statistics (count, dimension, similarity metrics)

Note: This command generates new embeddings. To reuse pre-computed embeddings stored in the backup/ directory or elsewhere, use the other embedding commands (find-similar, visualize-clusters, etc.) which have built-in support for loading from saved files.

find-similar

Find the most similar terms to a query term.

# Load from current directory
trowel embeddings find-similar \
  -e bervo_embeds.csv \
  -q BERVO:0000026 \
  -n 20 \
  -o results.txt

# Load from backup/ directory (auto-discovered)
trowel embeddings find-similar \
  -e bervo_embeds.csv \
  -q BERVO:0000026 \
  -n 20

# Load only first 5000 terms from large file
trowel embeddings find-similar \
  -e bervo_embeds.csv \
  -q BERVO:0000026 \
  -n 20 \
  -l 5000

Options:

  • -e, --embeddings TEXT - Path to embedding CSV file. Can be:
    • Full path: /path/to/embeddings.csv
    • Filename: bervo_embeds.csv (searches current dir and backup/)
    • Without extension: bervo_embeds (auto-adds .csv and checks backup/)
  • -q, --query TEXT - Query term ID or label (required)
  • -n, --top-n INTEGER - Number of results (default: 10)
  • -l, --limit INTEGER - Maximum number of terms to load (optional, useful for large files)
  • -s, --skip INTEGER - Number of terms to skip from beginning (default: 0)
  • -o, --output TEXT - Output file for results (optional)

Embedding File Locations:

  • Current working directory: bervo_embeds.csv
  • Backup directory: backup/bervo_embeds.csv (auto-discovered)
  • Can also pre-generate and store embeddings in backup/ for reuse across sessions

visualize-clusters

Create 2D visualizations of term clusters.

# Fast PCA visualization
trowel embeddings visualize-clusters \
  -e bervo_embeds.csv \
  -m pca \
  -o clusters_pca.png

# Better quality t-SNE visualization (slower)
trowel embeddings visualize-clusters \
  -e bervo_embeds.csv \
  -m tsne \
  -o clusters_tsne.png

# Load from backup/ and use only first 5000 terms
trowel embeddings visualize-clusters \
  -e bervo_embeds.csv \
  -m pca \
  -o clusters_pca.png \
  -l 5000

Options:

  • -e, --embeddings TEXT - Path to embedding CSV file. Can be:
    • Full path: /path/to/embeddings.csv
    • Filename: bervo_embeds.csv (searches current dir and backup/)
    • Without extension: bervo_embeds (auto-adds .csv and checks backup/)
  • -m, --method [pca|tsne] - Dimensionality reduction method (default: pca)
  • -l, --limit INTEGER - Maximum number of terms to load (optional, useful for large files)
  • -s, --skip INTEGER - Number of terms to skip from beginning (default: 0)
  • -o, --output TEXT - Output PNG file
  • --label-interval INTEGER - Interval for labeling points (default: 100)

visualize-by-category

Visualize clusters colored by ontology category.

trowel embeddings visualize-by-category \
  -s bervo.csv \
  -e bervo_embeds.csv \
  -o clusters_categorical.png

Options:

  • -s, --source-csv TEXT - Source CSV with category information (required)
  • -e, --embeddings TEXT - Embedding CSV file (required)
  • -o, --output TEXT - Output PNG file
  • --label-interval INTEGER - Interval for labeling points (default: 100)

visualize-heatmap

Create similarity heatmap for a subset of terms.

trowel embeddings visualize-heatmap \
  -e bervo_embeds.csv \
  -n 50 \
  -o similarity_heatmap.png

Options:

  • -e, --embeddings TEXT - Embedding CSV file (required)
  • -n, --num-terms INTEGER - Number of terms to include (default: 50)
  • -o, --output TEXT - Output PNG file

cross-collection-similarity

Find similar term pairs between two collections.

trowel embeddings cross-collection-similarity \
  -b bervo_embeds.csv \
  -n new_vars_embeds.csv \
  -t 25 \
  -o matches.txt

Options:

  • -b, --bervo-embeddings TEXT - First embedding file (required)
  • -n, --new-embeddings TEXT - Second embedding file (required)
  • -t, --top-n INTEGER - Number of top pairs (default: 25)
  • -o, --output TEXT - Output file for results (optional)

Common Workflows

Workflow 1: Extract and Match Variables

# 0. Download BERVO ontology
trowel get-bervo -o bervo.csv

# 1. Get dataset metadata
trowel get-essdive-metadata -p dois.txt -o ./data

# 2. Extract variables from datasets
trowel get-essdive-variables -p ./data/filetable.tsv -o ./data

# 3. Match variables to BERVO terms
trowel match-term-lists \
  -t bervo.csv \
  -l ./data/variable_names.txt \
  -o matched_variables.tsv \
  -f

Workflow 2: Analyze Term Relationships

# 1. Download BERVO ontology
trowel get-bervo -o bervo.csv

# 2. Prepare data for embedding
trowel embeddings prepare-embeddings \
  -i bervo.csv \
  -o bervo_prepared.csv \
  -c 0,1,6,12 \
  --skip-rows 1

# 3. Generate embeddings (using OpenAI API via CurateGPT)
# This generates embeddings and saves them to backup/ for reuse
trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -c bervo \
  -d backup/bervo.duckdb \
  -e backup/bervo_embeds.csv

# 4. Find similar terms
trowel embeddings find-similar \
  -e bervo_embeds.csv \
  -q BERVO:0000026 \
  -n 20 \
  -o similar_terms.txt

# 5. Create visualizations
trowel embeddings visualize-clusters \
  -e bervo_embeds.csv \
  -m pca \
  -o clusters_pca.png

trowel embeddings visualize-by-category \
  -s bervo.csv \
  -e bervo_embeds.csv \
  -o clusters_by_category.png

Workflow 3: Compare Ontologies

# 1. Download BERVO for comparison (optional)
trowel get-bervo -o ontology1.csv

# 2. Prepare both ontologies
trowel embeddings prepare-embeddings \
  -i ontology1.csv \
  -o ontology1_prepared.csv \
  -c 0,1,6 --skip-rows 1

trowel embeddings prepare-embeddings \
  -i ontology2.csv \
  -o ontology2_prepared.csv \
  -c 0,1,6 --skip-rows 1

# 3. Generate embeddings for both (using OpenAI API via CurateGPT)
trowel embeddings generate-embeddings \
  -i ontology1_prepared.csv \
  -c ontology1 \
  -d backup/ontology1.duckdb \
  -e backup/ont1_embeds.csv

trowel embeddings generate-embeddings \
  -i ontology2_prepared.csv \
  -c ontology2 \
  -d backup/ontology2.duckdb \
  -e backup/ont2_embeds.csv

# 4. Compare the ontologies
trowel embeddings cross-collection-similarity \
  -b backup/ont1_embeds.csv \
  -n backup/ont2_embeds.csv \
  -t 50 \
  -o ontology_comparison.txt

Embedding Storage and Reuse

Pre-computed embeddings can be stored in the backup/ directory and automatically discovered by TROWEL's embedding commands. This allows you to generate embeddings once and reuse them across many analysis sessions without regenerating them.

Recommended Workflow

  1. Generate embeddings once using the integrated command and save to backup/:
# Generate embeddings using trowel (handles OpenAI API calls internally)
# This saves both the DuckDB database AND exports to CSV
trowel embeddings generate-embeddings \
  -i bervo_prepared.csv \
  -c bervo \
  -d backup/bervo.duckdb \
  -e backup/bervo_embeds.csv
  1. Reuse embeddings for analysis (no regeneration needed):
# Auto-finds in backup/
trowel embeddings find-similar -e bervo_embeds.csv -q BERVO:0000026

# Work with subsets of large files
trowel embeddings visualize-clusters -e bervo_embeds.csv -m pca -l 5000 -o viz.png

# All searches check: current dir → backup/ → auto-add .csv → backup/ with .csv

Why Store in backup/?

  • Separation: Keeps generated embeddings separate from source data
  • Reusability: Multiple workflows can reference the same embeddings
  • Auto-discovery: Commands automatically search backup/ for files
  • Backup: Easy to version control and backup important embedding results

File Resolution

When you specify an embedding file with -e bervo_embeds.csv, TROWEL searches:

  1. Current working directory: bervo_embeds.csv
  2. Backup directory: backup/bervo_embeds.csv
  3. Current dir with .csv: bervo_embeds.csv (if you specify bervo_embeds)
  4. Backup with .csv: backup/bervo_embeds.csv (if you specify bervo_embeds)

This allows you to:

  • Pre-compute embeddings and store them once in backup/
  • Reference them by simple filename from anywhere in your workflow
  • Work with multiple embedding versions without specifying full paths

About BERVO

TROWEL is specifically designed to work with the BERVO ontology, a comprehensive biogeochemical ontology for environmental science variables. While TROWEL may work with other ontologies, optimal results are achieved with BERVO due to its well-structured hierarchy and comprehensive variable definitions.

BERVO includes:

  • Standardized variable definitions
  • Clear hierarchical organization (categories starting with BERVO:9)
  • Units and measurement specifications
  • Integration with ESS-DIVE datasets

Downloading BERVO

Download the latest BERVO ontology directly from the official source:

trowel get-bervo -o bervo.csv

This downloads the latest BERVO from Google Sheets, making it easy to keep your ontology current.

Setup & Configuration

Environment Variables

# Required for ESS-DIVE commands
export ESSDIVE_TOKEN=your_token_here

# Required for embedding generation (CurateGPT)
export OPENAI_API_KEY=your_api_key_here

Getting ESS-DIVE Access

  1. Visit https://docs.ess-dive.lbl.gov/programmatic-tools/ess-dive-dataset-api#get-access
  2. Follow the authentication instructions
  3. Set your token as shown above

Getting OpenAI API Key

  1. Sign up for an OpenAI account at https://openai.com
  2. Navigate to the API keys section
  3. Create a new API key
  4. Set as shown above

Requirements & Dependencies

Core

  • Python 3.10+
  • click (CLI framework)
  • requests (HTTP client)
  • polars (data processing)
  • tqdm (progress bars)
  • openpyxl, xlrd (spreadsheet support)

Optional (for embeddings)

  • numpy, scipy (numerical operations)
  • matplotlib, seaborn (visualization)
  • scikit-learn (dimensionality reduction)
  • duckdb (database access)
  • curategpt (embedding generation)

Install optional dependencies:

pip install matplotlib seaborn scikit-learn scipy duckdb

Troubleshooting

"ESSDIVE_TOKEN is not set"

export ESSDIVE_TOKEN=your_token_here

"Term not found" error

  • Verify the exact spelling of the query term
  • Check that the term exists in your embedding file

Visualization takes too long

  • Use PCA instead of t-SNE for quick previews
  • Reduce the number of labeled points with --label-interval

Out of memory with large datasets

  • Process datasets in batches
  • Use smaller subsets for visualization
  • Consider using PCA for dimensionality reduction

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Submit a pull request

License

BSD-3-Clause - See LICENSE file for details

Citation

If you use TROWEL in your research, please cite:

@software{trowel,
  title = {TROWEL: Tool for Retrieving, Organizing, and Wrangling Ecological Labels},
  url = {https://github.com/bioepic-data/trowel},
  year = {2024}
}

Contact

For issues, questions, or suggestions, please open an issue on GitHub: https://github.com/bioepic-data/trowel/issues

Related Resources

About

TROWEL - a Tool for Retrieving, Organizing, and Wrangling Earth science data and Labels

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •