-
Notifications
You must be signed in to change notification settings - Fork 0
Tutorials
This page provides step-by-step tutorials and practical examples for common tasks with the AVrC database and toolkit.
This tutorial shows how to download the AVrC database and prepare it for analysis.
avrc download --listOutput:
Available subsets:
all: All representative sequences with metadata (5.1GB)
hq: High quality sequences (2.5GB)
phage: Bacteriophage sequences (4.1GB)
# Download the avrc
avrc download all -o avrc_data# List the downloaded files
ls -lh avrc_dataOutput:
total 5.9G
-rw-rw-r-- 1 ubuntu ubuntu 4.7G Mar 16 07:17 AVrC_allrepresentatives.fasta.gz
-rw-r--r-- 1 ubuntu ubuntu 128M Jun 9 2024 AvRCv1.Merged_PredictedHosts.csv
-rw-r--r-- 1 ubuntu ubuntu 69M May 28 2024 AvRCv1.Merged_Quality.csv
-rw-r--r-- 1 ubuntu ubuntu 131M May 28 2024 AvRCv1.Merged_ViralDesc.csv
-rw-r--r-- 1 ubuntu ubuntu 82M Jun 2 2024 AvRCv1.SequenceTable.csv
-rw-r--r-- 1 ubuntu ubuntu 164M May 28 2024 AvRCv1.Tools_CheckV.csv
-rw-r--r-- 1 ubuntu ubuntu 88M May 28 2024 AvRCv1.Tools_Genomad.plasmid.csv
-rw-r--r-- 1 ubuntu ubuntu 276M May 28 2024 AvRCv1.Tools_Genomad.viral.csv
-rw-r--r-- 1 ubuntu ubuntu 72M May 28 2024 AvRCv1.Tools_PhaGCN.csv
-rw-r--r-- 1 ubuntu ubuntu 86M May 28 2024 AvRCv1.Tools_PhaTyp.csv
-rw-r--r-- 1 ubuntu ubuntu 189M Jun 9 2024 AvRCv1.Tools_iPHoP.csv
This tutorial demonstrates how to filter sequences based on basic criteria.
# Filter for only complete viral sequences
avrc filter avrc_data/ \
--quality Complete \
--output both \
--output-dir complete_viruses/# Check the output directory
ls -lh complete_viruses/Output:
total 656M
-rw-rw-r-- 1 ubuntu ubuntu 3.1M Mar 16 08:10 filtered_hosts.csv
-rw-rw-r-- 1 ubuntu ubuntu 2.2M Mar 16 08:10 filtered_quality.csv
-rw-rw-r-- 1 ubuntu ubuntu 647M Mar 16 08:12 filtered_sequences.fasta.gz
-rw-rw-r-- 1 ubuntu ubuntu 3.3M Mar 16 08:10 filtered_viral_desc.csv
The filter log states that 36802 complete viral sequences were retrieved with 41609 predicted hosts (note that some vOTU will have several predicted hosts assigned). Let's check the number of sequences subsetted by the filtering step.
# Count sequences using seqkit
seqkit stats complete_viruses/filtered_sequences.fasta.gzOutput:
file format type num_seqs sum_len min_len avg_len max_len
complete_viruses/filtered_sequences.fasta.gz FASTA DNA 36,802 2,145,039,362 1,604 58,285.9 1,369,938
This example shows how to extract bacteriophages that target specific bacterial genera.
# Filter for Phocaeicola-targeting phages
avrc filter avrc_data \
--host-genus Phocaeicola \
--output both \
--output-dir phocaeicola_phages/
# Filter for Paraprevotella-targeting phages
avrc filter avrc_data \
--host-genus Paraprevotella \
--output both \
--output-dir paraprevotella_phages/
# Compare the number of sequences in each dataset
echo "Bacteroides phages:"
seqkit stats phocaeicola_phages/filtered_sequences.fasta.gz
echo "Prevotella phages:"
seqkit stats paraprevotella_phages/filtered_sequences.fasta.gzThis example shows how to combine multiple filters to create a highly specific dataset.
# Filter for vOTU longer than 5kb from temperate Caudoviricetes phages targeting Bacillota (previously Firmicutes)
avrc filter avrc_data/ \
--min-length 5000 \
--class Caudoviricetes \
--lifestyle temperate \
--host-phylum Bacillota \
--output both \
--output-dir firmicutes_temperate_phages/Each of these examples provides practical guidance for working with the AVrC database using the toolkit. You can modify the commands to suit your specific research needs.