High-performance viral and microbial genome repository with Parquet storage, DuckDB querying, and integrated search tooling.
- Source layout: installable package under
src/gvmagdb, CLI entry pointgvmagdb. - Generated data: written to
artifacts/(parquet/,products/,duckdb/) to keep the repo clean. - Documentation: full contributor and operations guides live in
docs/(start withdocs/README.md).
pixi install # create/solve environment
pixi run install-package # editable install with console script
pixi run ingest \ # ingest genomes + metadata from FASTA/TSV
--fna ingestion_data/gvmagsV2all.fna \
--faa ingestion_data/gvmagsV2all.faa \
--metadata ingestion_data/Updated_naming_Sept2025.tsv
pixi run dashboard-cache # pre-compute dashboard summaries (optional but recommended)
pixi run stats # report high-level database metrics
pixi run dashboard # launch the interactive analytics UIDaily development helpers:
pixi run fmt,pixi run lint,pixi run typecheck(orpixi run checkfor the full bundle).pixi run build-diamond,pixi run export-fna,pixi run export-faato refresh search products.pixi run search-diamond,search-hmm,skanifor query workflows.
flowchart LR
subgraph Source Data
A[Tarball: GVMAGS_V2_data.tar.gz]
end
subgraph Ingestion
B[CLI: gvmagdb ingest-cmd]
C[Annotations Loader]
end
subgraph Storage
D[artifacts/parquet<br/>DuckDB views]
E[artifacts/products<br/>(FASTA, DMND, FNA)]
F[artifacts/duckdb<br/>local catalog]
end
subgraph Workflows
G[Query: gvmagdb diamond/hmmsearch/skani]
H[Analytics: docs/Database_Workflows<br/>Plotly Dash app]
end
A --> B
B --> C --> D
B --> E
D --> F
D --> G
E --> G
G --> H
D --> H
- Ingestion pulls FASTA/metadata from
ingestion_data/, enriches with ProteinOrtho & eggNOG annotations, and writes Parquet partitions plus search products intoartifacts/. - Query commands read DuckDB views (no duplication) and reuse generated FASTA/DMND artifacts.
- The Plotly Dash dashboards consume the same artifacts via read-only DuckDB connections (see
docs/Database_Workflows.mdanddocs/Repository_Guidelines.md). - For background deployment, install the systemd unit in
dashboard/gvmagdb-dashboard.serviceand enable Tailscale Funnel (seedocs/Deployment.md).
The dashboards scan ~5M sequences; precomputing aggregates keeps page loads responsive.
- Run
pixi run dashboard-cacheafter ingestion or whenever the database changes. Results live underartifacts/analytics/(override viaGVMAGDB_ANALYTICS_CACHE_DIR). - At runtime the data-access layer automatically reads cached Parquet files. Set
GVMAGDB_ANALYTICS_CACHE=falseto bypass caching during development.
- Repository Guidelines – coding style, testing, PR hygiene.
- Database Workflows – provisioning, analytics, remote querying, dashboard roadmap.
- Database Schema Reference – Parquet partitions and table descriptions.
- Distribution Workflow – release checklist and artifact publishing.