Skip to content

Tutorials

Alise Ponsero edited this page Mar 16, 2025 · 1 revision

Tutorials and Examples

This page provides step-by-step tutorials and practical examples for common tasks with the AVrC database and toolkit.

Basic Workflows

Downloading and Setting Up the Database

This tutorial shows how to download the AVrC database and prepare it for analysis.

Step 1: Check available datasets

avrc download --list

Output:

Available subsets:
all: All representative sequences with metadata (5.1GB)
hq: High quality sequences (2.5GB)
phage: Bacteriophage sequences (4.1GB)

Step 2: Download the AVrC database

# Download the avrc
avrc download all -o avrc_data

Step 3: Explore the downloaded files

# List the downloaded files
ls -lh avrc_data

Output:

total 5.9G
-rw-rw-r-- 1 ubuntu ubuntu 4.7G Mar 16 07:17 AVrC_allrepresentatives.fasta.gz
-rw-r--r-- 1 ubuntu ubuntu 128M Jun  9  2024 AvRCv1.Merged_PredictedHosts.csv
-rw-r--r-- 1 ubuntu ubuntu  69M May 28  2024 AvRCv1.Merged_Quality.csv
-rw-r--r-- 1 ubuntu ubuntu 131M May 28  2024 AvRCv1.Merged_ViralDesc.csv
-rw-r--r-- 1 ubuntu ubuntu  82M Jun  2  2024 AvRCv1.SequenceTable.csv
-rw-r--r-- 1 ubuntu ubuntu 164M May 28  2024 AvRCv1.Tools_CheckV.csv
-rw-r--r-- 1 ubuntu ubuntu  88M May 28  2024 AvRCv1.Tools_Genomad.plasmid.csv
-rw-r--r-- 1 ubuntu ubuntu 276M May 28  2024 AvRCv1.Tools_Genomad.viral.csv
-rw-r--r-- 1 ubuntu ubuntu  72M May 28  2024 AvRCv1.Tools_PhaGCN.csv
-rw-r--r-- 1 ubuntu ubuntu  86M May 28  2024 AvRCv1.Tools_PhaTyp.csv
-rw-r--r-- 1 ubuntu ubuntu 189M Jun  9  2024 AvRCv1.Tools_iPHoP.csv

Basic Sequence Filtering

This tutorial demonstrates how to filter sequences based on basic criteria.

Step 1: Filter for complete viral sequences

# Filter for only complete viral sequences
avrc filter avrc_data/ \
  --quality Complete \
  --output both \
  --output-dir complete_viruses/

Step 2: Examine the results

# Check the output directory
ls -lh complete_viruses/

Output:

total 656M
-rw-rw-r-- 1 ubuntu ubuntu 3.1M Mar 16 08:10 filtered_hosts.csv
-rw-rw-r-- 1 ubuntu ubuntu 2.2M Mar 16 08:10 filtered_quality.csv
-rw-rw-r-- 1 ubuntu ubuntu 647M Mar 16 08:12 filtered_sequences.fasta.gz
-rw-rw-r-- 1 ubuntu ubuntu 3.3M Mar 16 08:10 filtered_viral_desc.csv

Step 3: Count the number of sequences

The filter log states that 36802 complete viral sequences were retrieved with 41609 predicted hosts (note that some vOTU will have several predicted hosts assigned). Let's check the number of sequences subsetted by the filtering step.

# Count sequences using seqkit
seqkit stats complete_viruses/filtered_sequences.fasta.gz

Output:

file                                          format  type  num_seqs        sum_len  min_len   avg_len    max_len
complete_viruses/filtered_sequences.fasta.gz  FASTA   DNA     36,802  2,145,039,362    1,604  58,285.9  1,369,938

Use Case Examples

Extracting Phages with Specific Hosts

This example shows how to extract bacteriophages that target specific bacterial genera.

# Filter for Phocaeicola-targeting phages
avrc filter avrc_data \
  --host-genus Phocaeicola \
  --output both \
  --output-dir phocaeicola_phages/

# Filter for Paraprevotella-targeting phages
avrc filter avrc_data \
  --host-genus Paraprevotella \
  --output both \
  --output-dir paraprevotella_phages/

# Compare the number of sequences in each dataset
echo "Bacteroides phages:"
seqkit stats phocaeicola_phages/filtered_sequences.fasta.gz

echo "Prevotella phages:"
seqkit stats paraprevotella_phages/filtered_sequences.fasta.gz

Combining Multiple Filters

This example shows how to combine multiple filters to create a highly specific dataset.

# Filter for vOTU longer than 5kb from temperate Caudoviricetes phages targeting Bacillota (previously Firmicutes)
avrc filter avrc_data/ \
  --min-length 5000 \
  --class Caudoviricetes \
  --lifestyle temperate \
  --host-phylum Bacillota \
  --output both \
  --output-dir firmicutes_temperate_phages/

Each of these examples provides practical guidance for working with the AVrC database using the toolkit. You can modify the commands to suit your specific research needs.