Detect virus (known and novel) from human RNA-seq data.
- Ziheng Chen (zihengc)
- Yian Liao (yian-liao)
- Aidan Place (aidanjplace)
- Yizhou Wang (yizhou0201)
Follow Snakemake's instruction on Installation via Conda. Make sure to have the Miniconda Python3 distribution installed as instructed, because this will handle all the software dependencies.
Download a local copy of this repository via
git clone https://github.com/CMU-03713/RNA2virus.git
Then cd into the Virus-Detection repository via
cd RNA2virus
All the following work should be done in this repository.
Before running the pipeline, please have the following files donwloaded and put into the repository:
- Reference human genome annotation gtf file: Required for STAR to build human genome index. We recommend downloading the NCBI RefSeq GTF file through UCSC genome browser via
wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz'
gunzip hg38.ncbiRefSeq.gtf.gz
gzip -d hg38.ncbiRefSeq.gtf.gz
- RNA-seq fastq files: These are the files from which you want to detect viral sequences. Download the RNA-seq fastq files into the
datafolder in this repository. If your file is single end reads, make sure it is namedsample_r1.fastq, wheresampleis your SRA number. If your file is paired end reads, make sure you have two filessample_r1.fastqandsample_r2.fastq.
Edit the config.yaml according to the instructions in it.
- First, activate Snakemake via
conda activate snakemake
- Then run
install.shto install the necessary softwares and build a human genome index via
vim config.yaml # add the GTF file directory and change the output directory of STAR hg38 genome index.
bash install.sh {cores}
replacing {cores} with the number of cores you have available.
In this step, we need to build a human genome index, which requires a RAM of at least 40GB. If your available RAM is less than 40GB, this step may fail or be killed. This step is expected to take a long time to run as well. As a reference, it takes around 30 minutes to run on a interactive RM node on psc bridges-2 with 16 cores.
- If you have single reads data, run the pipeline for single reads data via
vim config.yaml # confirm genomeDir is the directory of STAR hg38 genome index. No need to change if the hg38 genome index is built from install.sh.
bash master.sh SE {cores} {sample_r1}
replacing {cores} with the number of cores you have available, replaccing {sample_r1} with the name of your fastq RNA-seq file, but without the .fastq extension. At this step, sample fastq file should be in /data and named as sample_r1.fastq.
- If you have paired end reads data, run the pipeline for paired end reads data via
vim config.yaml # confirm genomeDir is the directory of STAR hg38 genome index. No need to change if the hg38 genome index is built from install.sh.
bash master.sh PE {cores} {sample}
replacing {cores} with the number of cores you have available, replaccing {sample} with the name of your sample RNA-seq. At this step, sample fastq files should be in /data and named as sample_r1.fastq and sample_r2.fastq.
All of output file for single read sample_r1.fastq or paired end sample_r1.fastq and sample_r2.fastq will be put into a directory with the same name of your sample, inside the Virus-Detection folder. Inside this folder, there will be the following:
- Trimmed sequences of the raw sequencing files named
_trimmed.fastain/trimmed_fastqdirectory. - Quality control of the raw sequence data named
_fastqc.htmland_fastqc.zipin/fastqc_reportdirectory. - RNA-seq alignment to human genome named
Aligned.out.samin/star_aligneddirectory. - A summary of the RNA-seq alignment to human genome named
Log.final.outin/star_aligneddirectory. (for more information related to STAR output in the/star_aligneddirectory, refer STAR User Manual) - RNA-seq reads unmapped to human genome in bam format named
aligned_unmapped.bamin/star_unmappeddirectory. - RNA-seq reads unmapped to human genome in fastq format.
aligned_unmapped.fqfor single read data, oraligned_unmapped1.fqandaligned_unmapped2.fqfor paired end data in/star_unmappeddirectory. - Assembled contigs named
final.contigs.fain/assembled_contigsdirectory. - BLAST report named
blast_out.txtin/blast_resultdirectory. - Open Reading Frame report named
contigsWithOrf.fastain/ORFfinderdirectory. - Secondary RNA strunctures named
secondary_structure.strin/RNAfolddirectory.