GVMAG Genome Database

High-performance viral and microbial genome repository with Parquet storage, DuckDB querying, and integrated search tooling.

Overview

Source layout: installable package under src/gvmagdb, CLI entry point gvmagdb.
Generated data: written to artifacts/ (parquet/, products/, duckdb/) to keep the repo clean.
Documentation: full contributor and operations guides live in docs/ (start with docs/README.md).

Quick Start

pixi install                     # create/solve environment
pixi run install-package         # editable install with console script
pixi run ingest \                # ingest genomes + metadata from FASTA/TSV
  --fna ingestion_data/gvmagsV2all.fna \
  --faa ingestion_data/gvmagsV2all.faa \
  --metadata ingestion_data/Updated_naming_Sept2025.tsv
pixi run dashboard-cache         # pre-compute dashboard summaries (optional but recommended)
pixi run stats                   # report high-level database metrics
pixi run dashboard               # launch the interactive analytics UI

Daily development helpers:

pixi run fmt, pixi run lint, pixi run typecheck (or pixi run check for the full bundle).
pixi run build-diamond, pixi run export-fna, pixi run export-faa to refresh search products.
pixi run search-diamond, search-hmm, skani for query workflows.

Database & Workflow Map

flowchart LR
    subgraph Source Data
        A[Tarball: GVMAGS_V2_data.tar.gz]
    end
    subgraph Ingestion
        B[CLI: gvmagdb ingest-cmd]
        C[Annotations Loader]
    end
    subgraph Storage
        D[artifacts/parquet<br/>DuckDB views]
        E[artifacts/products<br/>(FASTA, DMND, FNA)]
        F[artifacts/duckdb<br/>local catalog]
    end
    subgraph Workflows
        G[Query: gvmagdb diamond/hmmsearch/skani]
        H[Analytics: docs/Database_Workflows<br/>Plotly Dash app]
    end

    A --> B
    B --> C --> D
    B --> E
    D --> F
    D --> G
    E --> G
    G --> H
    D --> H

Ingestion pulls FASTA/metadata from ingestion_data/, enriches with ProteinOrtho & eggNOG annotations, and writes Parquet partitions plus search products into artifacts/.
Query commands read DuckDB views (no duplication) and reuse generated FASTA/DMND artifacts.
The Plotly Dash dashboards consume the same artifacts via read-only DuckDB connections (see docs/Database_Workflows.md and docs/Repository_Guidelines.md).
For background deployment, install the systemd unit in dashboard/gvmagdb-dashboard.service and enable Tailscale Funnel (see docs/Deployment.md).

Analytics Cache

The dashboards scan ~5M sequences; precomputing aggregates keeps page loads responsive.

Run pixi run dashboard-cache after ingestion or whenever the database changes. Results live under artifacts/analytics/ (override via GVMAGDB_ANALYTICS_CACHE_DIR).
At runtime the data-access layer automatically reads cached Parquet files. Set GVMAGDB_ANALYTICS_CACHE=false to bypass caching during development.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
bin		bin
dashboard		dashboard
docs		docs
src/gvmagdb		src/gvmagdb
tests		tests
.gitignore		.gitignore
README.md		README.md
pixi.toml		pixi.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GVMAG Genome Database

Overview

Quick Start

Database & Workflow Map

Analytics Cache

Further Reading

About

Uh oh!

Releases

Packages

Languages

NeLLi-team/gvmagDB

Folders and files

Latest commit

History

Repository files navigation

GVMAG Genome Database

Overview

Quick Start

Database & Workflow Map

Analytics Cache

Further Reading

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages