Skip to content

Command Reference

Alise Ponsero edited this page Mar 16, 2025 · 2 revisions

Command Reference

This page provides detailed information about the commands available in the AVrC Toolkit, including their options, arguments, and example usage.

Overview

The AVrC Toolkit provides two main commands:

  • download: For retrieving the AVrC database or its subsets
  • filter: For filtering sequences based on various criteria

Download Command

The download command allows you to retrieve the AVrC database or specific subsets.

Usage

avrc download [SUBSET] [OPTIONS]

Arguments

  • SUBSET: The subset of the database to download. Available options:
    • all: Complete dataset with all representative sequences
    • hq: High-quality sequences subset
    • phage: Bacteriophage sequences subset

Options

  • --list: List available subsets and their descriptions
  • -o, --output PATH: Output directory (default: current directory)
  • --no-metadata: Download sequences only (no metadata files)

Examples

List available subsets:

avrc download --list

Download complete dataset:

avrc download all -o data/

Download high-quality subset:

avrc download hq -o high_quality_data/

Download phage subset without metadata:

avrc download phage -o phage_data/ --no-metadata

Filter Command

The filter command allows you to filter sequences based on various criteria.

Usage

avrc filter PATH [OPTIONS]

Arguments

  • PATH: Path to the directory containing AVrC database files

Options

Quality Filtering

  • --quality TEXT: Filter by CheckV quality category [Complete|High-quality|Medium-quality|Low-quality]
  • --min-length INT: Minimum sequence length
  • --no-plasmids: Exclude sequences classified as potential plasmids

Taxonomy Filtering

  • --realm TEXT: Filter by viral realm
  • --kingdom TEXT: Filter by viral kingdom
  • --phylum TEXT: Filter by viral phylum
  • --class TEXT: Filter by viral class
  • --order TEXT: Filter by viral order
  • --family TEXT: Filter by viral family

Lifestyle Filtering

  • --lifestyle TEXT: Filter by predicted lifestyle [temperate|virulent|uncertain]

Host Filtering

  • --host-domain TEXT: Filter by host domain
  • --host-phylum TEXT: Filter by host phylum
  • --host-class TEXT: Filter by host class
  • --host-order TEXT: Filter by host order
  • --host-family TEXT: Filter by host family
  • --host-genus TEXT: Filter by host genus

Output Options

  • --output [fasta|metadata|both]: Output format (default: both)
  • --output-dir PATH: Output directory (default: filtered/)

Examples

Basic quality filtering:

avrc filter data/ \
  --quality High-quality \
  --no-plasmids \
  --output fasta

Host-specific filtering:

avrc filter data/ \
  --host-phylum Bacillota \
  --output both \
  --output-dir filtered/

Combined filtering:

avrc filter data/ \
  --min-length 10000 \
  --lifestyle temperate \
  --host-genus Campylobacter \
  --output both \
  --output-dir campylobacter_phages/

Output Files

When using the filter command with --output both, the following files are generated:

  • filtered_sequences.fasta.gz: Filtered sequences in compressed FASTA format
  • filtered_quality.csv: Quality metrics for filtered sequences
  • filtered_viral_desc.csv: Taxonomic information for filtered sequences
  • filtered_hosts.csv: Host predictions for filtered sequences

Resource Requirements

Different operations require different amounts of computational resources:

  • Memory Usage:

    • Basic filtering: 2-4GB
    • Complex filtering with large datasets: 4-8GB
  • Disk Space:

    • Complete dataset: ~10GB
    • High-quality subset: ~5GB
    • Phage subset: ~3GB

Clone this wiki locally