GVClass assigns taxonomy to giant virus contigs and metagenome-assembled genomes (GVMAGs). It uses phylogenetic analysis based on giant virus orthologous groups (GVOGs) to provide accurate classification from domain to species level.
Best for: Contributing code, modifying the pipeline, or running locally
# 1. Install Pixi (one-time)
curl -fsSL https://pixi.sh/install.sh | bash
# 2. Clone repository
git clone https://github.com/NeLLi-team/gvclass.git
cd gvclass
# 3. Install dependencies (pixi handles everything)
pixi install
# 4. Run GVClass (must run from repo directory)
pixi run gvclass <input_directory> -t 16
# Optional: install CLI wrappers into ~/bin
pixi run install-cli
# Test installation with example data
pixi run run-exampleBest for: Running on HPC clusters, no installation needed
# Download the wrapper script
wget https://raw.githubusercontent.com/NeLLi-team/gvclass/main/gvclass-a
chmod +x gvclass-a
# Run from anywhere (image auto-downloads on first use)
./gvclass-a /path/to/query_genomes /path/to/results -t 32
# With options
./gvclass-a my_data my_results -t 32 --tree-method iqtree --mode-fastThe Apptainer image includes the database (~700MB) and all dependencies. No setup needed!
- Directory containing
.fna(nucleic acid) or.faa(protein) files - Minimum size: 20kb recommended (50kb+ preferred)
- Clean filenames: Avoid special characters (
.;:), use_or-instead - Protein headers: Format as
filename|proteinidfor best results
# Basic usage
pixi run gvclass my_genomes -o my_results -t 32
# With options
pixi run gvclass my_genomes -t 32 --mode-fast --tree-method iqtree -j 4
# Classify each contig separately (useful for metagenome contigs)
pixi run gvclass --contigs my_genome.fna -o results -t 32# Basic usage
./gvclass-a my_genomes my_results -t 32
# Fast mode (skip order-level markers for 2-3x speedup)
./gvclass-a my_genomes my_results -t 32 --mode-fast
# Use IQ-TREE for more accurate phylogeny (slower)
./gvclass-a my_genomes my_results -t 32 --tree-method iqtree
# Control parallelization (4 workers × 8 threads = 32 total)
./gvclass-a my_genomes my_results -t 32 -j 4
# Classify each contig in a single FNA file separately
./gvclass-a --contigs my_genome.fna -o results -t 32Results are saved to <input_name>_results/ containing:
gvclass_summary.tsv- Main results with taxonomy assignments (legacy format)gvclass_summary.csv- Main results with taxonomy assignments (CSV for spreadsheets)- Individual query subdirectories with detailed analysis
| Column | Description |
|---|---|
| query | Input filename |
| taxonomy_majority | Full taxonomy based on majority rule |
| taxonomy_strict | Conservative taxonomy (100% agreement) |
| species → domain | Individual taxonomic levels with taxon counts |
| avgdist | Average tree distance to references |
| order_dup | Duplication factor indicating contamination level |
| order_completeness | Order-specific completeness (% unique markers found) |
| gvog4_unique | Count of unique GVOG4 markers found |
| gvog8_unique/total/dup | GVOG8 marker counts and duplication |
| ncldv_mcp_total | NCLDV-specific MCP marker count |
| mcp_total | All MCP marker count (NCLDV + Mirus) |
| vp_completeness | Virophage completeness (n/4 core markers: MCP, Penton, ATPase, Protease) |
| vp_mcp | Count of proteins with VP MCP marker hits |
| plv | Count of proteins with PLV marker hits (single PLV marker; values can be 0..N) |
| vp_df | Virophage duplication factor (total VP hits / 4) |
| mirus_completeness | Mirusviricota completeness (n/4 core markers: MCP, ATPase, Portal, Triplex) |
| mirus_df | Mirusviricota duplication factor |
| mrya_unique/total | Mryavirus-specific marker counts |
| phage_unique/total | Phage marker counts |
| cellular_unique/total/dup | Cellular contamination markers |
| contigs | Number of contigs |
| LENbp | Total length in base pairs |
| GCperc | GC content percentage |
| genecount | Number of predicted genes |
| CODINGperc | Coding density percentage |
| ttable | Genetic code used |
| weighted_order_completeness | NEW: Weighted completeness score considering marker importance |
Create gvclass_config.yaml to set defaults:
database:
path: resources # Database location
pipeline:
tree_method: fasttree # or 'iqtree' for more accuracy
mode_fast: false # Skip order-level marker trees when true (speeds up analysis)
threads: 16 # Default thread count- Expanded Database: Added Mirusviricota genomes, virophages (PV), Polinton-like viruses (PLV), and extended phage references from MetaVR
- Updated GA Thresholds: Refined gathering thresholds in HMM models for more accurate marker detection
- Model Annotations: Added functional annotations to HMM models
- Documentation: Comprehensive CLI reference, genetic code selection logic, and contig splitting details
- Code Quality: Bug fixes and cleaner codebase
The gvclass-a wrapper handles container execution automatically. For manual control:
# Pull the image manually
apptainer pull library://nelligroup-jgi/gvclass/gvclass:1.2.0
# Run with manual bind mounts
apptainer run -B /path/to/data:/input -B /path/to/results:/output \
gvclass_1.2.0.sif /input -o /output -t 32The wrapper is simpler and handles bind mounts automatically.
To make apptainer pull library://nelligroup-jgi/gvclass/gvclass:1.2.0 work, you must build and push the SIF to the Sylabs library:
# Build the SIF from the definition file
apptainer build gvclass.sif containers/apptainer/gvclass.def
# Authenticate to the Sylabs library (one-time)
apptainer remote login
# Push the image to the library
apptainer push gvclass.sif library://nelligroup-jgi/gvclass/gvclass:1.2.0| Option | Short | Description | Default |
|---|---|---|---|
query_dir |
Input directory or file | Required | |
--output-dir |
-o |
Output directory | <query>_results |
--threads |
-t |
Total threads | 16 |
--max-workers |
-j |
Parallel workers | Auto |
--threads-per-worker |
Threads per worker | Auto | |
--database |
-d |
Override database path | Auto |
--tree-method |
fasttree or iqtree |
fasttree | |
--mode-fast |
-f |
Fast mode: core markers only | True |
--extended |
-e |
Extended mode: all marker trees | False |
--contigs |
-C |
Split multi-contig file | False |
--resume |
Resume interrupted run | False | |
--verbose |
-v |
Enable debug output | False |
--version |
Show version info | ||
--cluster-type |
local, slurm, pbs, sge |
local | |
--cluster-queue |
HPC queue/partition name | ||
--cluster-project |
HPC project/account | ||
--cluster-walltime |
HPC time limit | 04:00:00 |
Process each contig in a multi-contig FNA file as a separate query:
# Split contigs and classify each independently
gvclass --contigs metagenome_contigs.fna -o results -t 32How it works:
- Requires a single FNA file (not a directory)
- Splits file into individual contig files in a temporary directory
- Sanitizes contig IDs for filenames (replaces
/\:*?"<>|and spaces with_) - Processes each contig independently through the full pipeline
- Combines results into
gvclass_summary.tsv - Cleans up temporary files automatically
Use cases:
- Classifying giant virus contigs from metagenome assemblies
- Processing binned genomes with multiple contigs
- Screening assembled sequences for giant virus candidates
GVClass tests 9 genetic codes to find optimal gene predictions:
- Code 0: Meta mode using pretrained models (pyrodigal metagenomic mode)
- Codes 1, 4, 11: Standard translation tables
- Codes 6, 15, 29, 106, 129: Additional translation tables
Selection logic:
- Start with meta mode (code 0) as baseline
- Override if another code has:
- More complete marker hits (>66% HMM coverage), OR
- Same hits but >5% better average hit score, OR
- Same hits but >5% better coding density
The selected code is reported in the ttable output column.
order_completenessshows the percentage of order-specific markers detected in single copy. Values above ~70% generally indicate near-complete genomes, 30–70% partial, and below 30% highly fragmented assemblies.weighted_order_completenessapplies conservation-based weights; large gaps here usually point to missing hallmark genes even if raw counts look acceptable.
order_dupandgvog8_dupsummarize marker duplication. Values above ~2 suggest multiple populations or assembly chimeras; below ~1.5 is typically clean.gvog8_totalandgvog8_uniquehelp distinguish true gene expansions (high total, moderate duplication) from assembly artefacts (high duplication, low uniqueness).ncldv_mcp_total,mirus_df,mrya_totalprovide additional lineage-specific duplication hints.vp_completenessandmirus_completenessshow core marker coverage (n/4) for virophages and Mirusviricota respectively.plvcount helps distinguish PLV from virophages (PLVs share VP markers but have additional PLV-specific marker; count is not binary).
uni56_total(UNI56) counts universal cellular markers; more than ~10 unique hits point to host contamination or bins that include cellular contigs.- The wrapper also reports
busco_*fields when available; elevated counts complement UNI56 for spotting non-viral material.
Use these fields together: a high completeness score with low duplication and low UNI56 is characteristic of a high-quality GVMAG; any combination of low completeness plus high duplication or high UNI56 warrants manual curation.
-
Enable Fast Mode - Skip order-level marker trees (OG markers):
# Command line option pixi run gvclass <input_directory> --mode-fast # Or in config file pipeline: mode_fast: true # Skips ~100 order-specific markers, 2-3x faster
-
Use FastTree Instead of IQ-TREE:
# Default (faster) pixi run gvclass <input_directory> --tree-method fasttree # IQ-TREE (more accurate but slower) pixi run gvclass <input_directory> --tree-method iqtree
-
Optimize Thread Usage:
# Use all available cores pixi run gvclass <input_directory> -t 32 # Control parallelization: 4 parallel workers, 8 threads each (= 32 total) pixi run gvclass <input_directory> -t 32 -j 4
When using IQ-TREE (--tree-method iqtree), the pipeline automatically uses:
- Model:
LG+F+G(fast protein model) -fastflag for faster tree search- Single thread per marker (parallelization happens at marker level)
To modify IQ-TREE behavior, edit src/core/marker_processing.py.
- Core markers: Always processed (GVOG4, GVOG8, MCP, etc.)
- Order-level markers: 576 OG markers conserved in different viral orders
- Processed when
mode_fast: false(default) - Skipped when
mode_fast: true(faster but less precise order assignment)
- Processed when
Both wrappers can be made globally accessible:
# Copy gvclass-a to your personal bin
mkdir -p "$HOME/bin"
cp gvclass-a "$HOME/bin/"
chmod +x "$HOME/bin/gvclass-a"
# Add to PATH (if not already)
echo 'export PATH="$HOME/bin:$PATH"' >> "$HOME/.bashrc"
source "$HOME/.bashrc"
# Now use from anywhere!
gvclass-a /data/genomes /data/results -t 32# From the gvclass repo directory
cd /path/to/gvclass # Navigate to your cloned repo first
# Create symlink (REQUIRED - copying won't work!)
mkdir -p "$HOME/bin"
ln -s "$(pwd)/gvclass" "$HOME/bin/gvclass"
# Add to PATH (if not already)
echo 'export PATH="$HOME/bin:$PATH"' >> "$HOME/.bashrc"
source "$HOME/.bashrc"
# Now run from anywhere - symlink allows script to find its repo
cd /anywhere
gvclass my_data -o results -t 32flowchart TD
subgraph Input
FNA[".fna<br/>nucleic acid"]
FAA[".faa<br/>amino acid"]
end
subgraph Database
DB[(Reference<br/>Database)]
MODELS[GVOG HMMs<br/>+ marker sets]
REF[Reference<br/>sequences]
end
FNA --> OPGC{Optimized<br/>Gene Calling}
FAA --> ID1[Identify Markers]
subgraph "Gene Calling (pyrodigal)"
OPGC --> META[Meta mode<br/>code 0<br/>pretrained models]
OPGC --> STD[Standard codes<br/>1, 4, 11]
OPGC --> ADD[Additional codes<br/>6, 15, 29, 106, 129]
META --> RANK[Select best by<br/>marker hits &<br/>coding density]
STD --> RANK
ADD --> RANK
end
RANK --> ID2[Identify Markers]
subgraph "Marker Detection (pyhmmer)"
ID1 --> HMM1[HMM search<br/>against GVOGs]
ID2 --> HMM2[HMM search<br/>against GVOGs]
HMM1 --> HITS1[Extract hits<br/>E-value cutoffs]
HMM2 --> HITS2[Extract hits<br/>E-value cutoffs]
end
HITS1 --> BLAST[BLAST/pyswrd<br/>top 100 hits]
HITS2 --> BLAST
subgraph "Alignment & Trees"
BLAST --> ALIGN[MAFFT/pyfamsa<br/>alignment]
ALIGN --> TRIM[TrimAl/pytrimal<br/>trimming]
TRIM --> TREE{Tree Building}
TREE -->|FastTree| FT[veryfasttree<br/>LG4X model]
TREE -->|IQ-TREE| IQ[iqtree<br/>LG+F+G -fast]
end
FT --> NN[Get nearest<br/>neighbors]
IQ --> NN
subgraph "Classification"
NN --> TAX[Majority/strict<br/>taxonomy<br/>assignment]
NN --> QC[Quality metrics:<br/>completeness,<br/>contamination]
end
TAX --> OUT[Results:<br/>taxonomy,<br/>QC metrics]
QC --> OUT
MODELS -.-> HMM1
MODELS -.-> HMM2
REF -.-> BLAST
DB -.-> NN
style FNA fill:#e8f4f8
style FAA fill:#e8f4f8
style OUT fill:#d4edda
style DB fill:#f8d7da
style MODELS fill:#f8d7da
style REF fill:#f8d7da
If you use GVClass, please cite:
Pitot et al. (2024): Conservative taxonomy and quality assessment of giant virus genomes with GVClass. npj Viruses. https://www.nature.com/articles/s44298-024-00069-7
The GVClass v1.2.0 reference database includes genomes from the following sources:
Medvedeva S, Guyet U, Pelletier E, et al. (2026): Widespread and intron-rich mirusviruses are predicted to reproduce in nuclei of unicellular eukaryotes. Nature Microbiology 11:228-239. https://doi.org/10.1038/s41564-025-01906-2
Roux S, Fischer MG, Hackl T, Katz LA, Schulz F, Yutin N (2023): Updated Virophage Taxonomy and Distinction from Polinton-like Viruses. Biomolecules 13(2):204. https://doi.org/10.3390/biom13020204
Fiamenghi MB, Camargo AP, Chasapi IN, et al. (2025): Meta-virus resource (MetaVR): expanding the frontiers of viral diversity with 24 million uncultivated virus genomes. Nucleic Acids Research gkaf1283. https://doi.org/10.1093/nar/gkaf1283
Vasquez YM, Nardi T, Terasaki GM, et al. (2025): Genome-resolved expansion of Nucleocytoviricota and Mirusviricota reveals new diversity, functional potential, and biotechnological applications. bioRxiv 2025.09.26.678796. https://doi.org/10.1101/2025.09.26.678796
- Issues: GitHub Issues
- Contact: fschulz@lbl.gov
BSD 3-Clause License - see LICENSE file for details
Version 1.2.0 - January 2026
