GitHub - Yasas1994/Vcat: A tool for adding taxonomic annotations to virus contigs, mapping reads to virus genomes and much more.

The Virus contig annotation tool (Vcat) is a straightforward, homology-based application designed to provide taxonomic annotations for virus contigs, mapping reads to virus contigs and much more.

Note

Vcat documentation is now online!

Changelog

0.0.3

VMR_40v2 is now available to download
added read annotation workflow

0.0.2

added downloaddb commad to download pre-built databases (VMR_40v1 and VMR_39v4)
added a definition file to build Apptainer containers
added subroutines to clean up tmp files generated duting the database build

Quick start

git clone https://github.com/Yasas1994/vcat.git
cd vcat

# create a conda env
 mamba create -f environment.yml

# install vcat pipeline
pip install .

# test the installation
vcat --help

Usage: vcat [OPTIONS] COMMAND [ARGS]...

  Vcat: a command-line tool-kit for adding ICTV taxonomy annotations to virus
  contigs, mapping reads to virus genomes and much more.
  (https://github.com/Yasas1994/vcat)

Options:
  --version   Show the version and exit.
  -h, --help  Show this message and exit.

Commands:
  contigs    run contig annotation workflow
  downloaddb download pre-built reference databases
  preparedb  download and build reference databases
  reads      run read annotation workflow
  utils      tool chain for calculating ani, aai and visualizations

Singularity (Now Apptainer)

git clone https://github.com/Yasas1994/vcat.git
cd vcat

# build container
apptainer build vcat.sif Apptainer

# test the container build
apptainer run vcat.sif vcat --help

Usage: vcat [OPTIONS] COMMAND [ARGS]...

  Vcat: a command-line tool-kit for adding ICTV taxonomy annotations to virus
  contigs, mapping reads to virus genomes and much more.
  (https://github.com/Yasas1994/vcat)

Options:
  --version   Show the version and exit.
  -h, --help  Show this message and exit.

Commands:
  contigs    run contig annotation workflow
  downloaddb download pre-built reference databases
  preparedb  download sequences and build reference databases
  reads      run read annotation workflow
  utils      tool chain for calculating ani, aai and visualizations

downloading pre-built databases

# pulling pre-built databases from remote server [msl39v and masl40v1]
vcat downloaddb --dbversion masl39v4 -d <path to save the database> --cores 1

downloading and preparing the databases

vcat preparedb -d <path to save the database>

running contig annotation pipeline

vcat contigs -i <input>.fasta -o outdir

results can be found in the results directory within the ouput directory

.
├── logs
├── nuc
│   ├── input_genome_ani.tsv
│   └── input_genome.m8
├── prof
│   ├── input_fasta_prof_api.tsv
│   └── input_fasta_prof.m8
├── prot
│   ├── input_fasta.faa
│   ├── input_fasta.gff
│   ├── input_fasta_prot_aai.tsv
│   └── input_fasta_prot.m8
├── results
│   ├── *input_fasta_ictv.csv
│   └── input_fasta.tsv
└── tmp

Expected runtime ?

It takes ~4hrs to run vcat on the ICTV Taxonomy challenge dataset on a laptop computer.

running read annotation pipeline

vcat reads -in <reads1>.fastq [-in2 <reads2.fastq>] -o outdir

running other workflows

# calculate aai of query contigs to ICTV genomes
vcat utils aai  [OPTIONS] -i contigs.m8 -g configs.gff -d [DBDIR]

# calculate ani of query contigs to ICTV genomes
vcat utils ani [OPTIONS] -i contigs.m8

# creating genome comparision plots. i.e query sequence to highly similar ICTV genomes (comming soon)
vcat utils visualize --ani --taxa [taxname] -i contigs.m8 -o outdir

# create phage contig annotation plots (coming soon)
vcat utils visualize --phrogs -i contigs.fasta -o outdir 

# identify provirus (coming soon)
vcat utils provirus -i contigs.fasta -o outdir

Some additional stuff

# to view the contig length distribution of your contigs
seqkit fx2tab -lg {input.fasta} | awk -F "\t" '{print $4}' | tail -n +2 | hist -b 100 -s 10

# to view the length distribution of contigs in the ani calculation output (after apply a filter)
cat testing/gut_jaeger/nuc/gut_jaeger_virus_seqs_fasta_genome_ani.tsv | awk -F "\t" '$8 > 0.1' | awk -F "\t" '{print $3}' | tail -n +2 | hist -b 100 -x

If you use vcat please cite,


[MMSEQS2] MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets.
          Steinegger M & Söding J. 2017. Nat. Biotech. 35, 1026–1028. https://doi.org/10.1038/nbt.3988

[PRODIGAL] Prodigal: prokaryotic gene recognition and translation initiation site identification.
           Hyatt et al. 2010. BMC Bioinformatics 11, 119. https://doi.org/10.1186/1471-2105-11-119.

[TAXONKIT] TaxonKit: A practical and efficient NCBI taxonomy toolkit.
           Shen, W. & Ren, H. J. 2021. Genet. Genomics https://doi:10.1016/j.jgg.2021.03.006.

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
docs		docs
results		results
scripts		scripts
vcat		vcat
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
Apptainer		Apptainer
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Changelog

Quick start

Singularity (Now Apptainer)

downloading pre-built databases

downloading and preparing the databases

running contig annotation pipeline

Expected runtime ?

running read annotation pipeline

running other workflows

Some additional stuff

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Changelog

Quick start

Singularity (Now Apptainer)

downloading pre-built databases

downloading and preparing the databases

running contig annotation pipeline

Expected runtime ?

running read annotation pipeline

running other workflows

Some additional stuff

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages