Discovery of Novel Endogenous Viruses
This project strives to implement the BUD algorithm from ViruSpy in Python. In short, reads from a Short Read Archive (SRA) are screened for putative virus motifs/domains using known virus nucleotide an protein sequences. Sequences containign such motifs or domains serve as queries in subsequent searches using the same SRA to extend the initial sequence until non-virus sequences/domians are encountered. This inidicates that either en exogenous virts has been identified or an endogenosu virus within a the host genome.
Setup analysis enviroment:
git clone https://github.com/NCBI-Hackathons/EndoVir.gitcd Endovirmkdir -p work/analysis/dbscd work/analysis/dbswget ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd/little_endian/Cdd_LE.tar.gztar -xzvf Cdd_LE.tar.gz
All external tools have to be currentlyin $PATH. Please see the corresponding
README files for installation instructions.
- cd ../../../ (should be in EndoVir/)
export ENDOVIR=$(pwd)mkdir toolscd tools- MagicBLAST 1.3.0 [check for updates]
wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/magicblast/LATEST/ncbi-magicblast-1.3.0-x64-linux.tar.gztar -xvzf ncbi-magicblast-1.3.0-x64-linux.tar.gzexport PATH=$ENDOVIR/EndoVir/tools/ncbi-magicblast-1.3.0/bin/
- sra-toolkit [check for updates; might also want to use ubuntu]
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.8.2-1/sratoolkit.2.8.2-1-centos_linux64.tar.gztar -xvzf sratoolkit.2.8.2-1-centos_linux64.tar.gzexport PATH=$ENDOVIR/tools/sratoolkit.2.8.2-1-centos_linux64/bin/
- MEGAHIT
git clone https://github.com/voutcn/megahit.gitcd megahitmake -j $(cat /proc/cpuinfo | grep processor | wc -l)(-j n, where n is the # of cores you want to use)export PATH=$ENDOVIR/tools/megahit*- [ABYSS2] wget http://www.bcgsc.ca/platform/bioinfo/software/abyss/releases/2.0.2/abyss-2.0.2.tar.gz
- [SOAPdenovo]
- [SPADES]*
echo "PATH=$PATH:$ENDOVIRPATH" >> ~/.bashrc# only if you switch the console/login after installsource ~/.bashrc# just in case
Change into your working directory work
cd $ENDOVIR/workpython3.6 ../src/endovir.py
The underlying design of endovir will facilitate the use of external tools, e.g.
assemblers or parser, without changing the BUD routine itself. Further, the
results of the intermediate steps can be parsed and used to set the parameters
for each subsequent step in the analysis pipeline.
The use of STDIN and STDOUT is used were possible to communicate between the external tools, thereby reducing the usage of intermediate files as much as possible. In addition, only the Python standard libraries should be used.
The pipline has three major steps (in src):
-
endovir.Endovir(): creates the analysis environment and prepares the screening. -
screener.Screener(): Initiates the screen and identifies the initial, putative virus contigs. -
viruscontig.VirusContig(): Each putative virus contig is expanded and analyzed independently.
external screening tools, e.g MagicBLAST, have wrappers and parser in their
corresponding namespace in lib.
Only the Python standard libraries are used. However, the pipeline depends on
several external tools which are called using the subprocess module:
- MagicBLAST
- BLAST+
- sra-toolkit
- MEGAHIT
