Virus-Detection

Detect virus (known and novel) from human RNA-seq data.

Authors

Ziheng Chen (zihengc)
Yian Liao (yian-liao)
Aidan Place (aidanjplace)
Yizhou Wang (yizhou0201)

Installation

Install Snakemake via Conda

Follow Snakemake's instruction on Installation via Conda. Make sure to have the Miniconda Python3 distribution installed as instructed, because this will handle all the software dependencies.

Download and `cd` into this repository

Download a local copy of this repository via

git clone https://github.com/CMU-03713/RNA2virus.git

Then cd into the Virus-Detection repository via

cd RNA2virus

All the following work should be done in this repository.

Required input files

Before running the pipeline, please have the following files donwloaded and put into the repository:

Reference human genome annotation gtf file: Required for STAR to build human genome index. We recommend downloading the NCBI RefSeq GTF file through UCSC genome browser via

wget --timestamping 'ftp://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz'
gunzip hg38.ncbiRefSeq.gtf.gz

gzip -d hg38.ncbiRefSeq.gtf.gz

RNA-seq fastq files: These are the files from which you want to detect viral sequences. Download the RNA-seq fastq files into the data folder in this repository. If your file is single end reads, make sure it is named sample_r1.fastq, where sample is your SRA number. If your file is paired end reads, make sure you have two files sample_r1.fastq and sample_r2.fastq.

Configure the workflow

Edit the config.yaml

Edit the config.yaml according to the instructions in it.

Run the workflow

First, activate Snakemake via

conda activate snakemake

Then run install.sh to install the necessary softwares and build a human genome index via

vim config.yaml # add the GTF file directory and change the output directory of STAR hg38 genome index.
bash install.sh {cores}

replacing {cores} with the number of cores you have available. In this step, we need to build a human genome index, which requires a RAM of at least 40GB. If your available RAM is less than 40GB, this step may fail or be killed. This step is expected to take a long time to run as well. As a reference, it takes around 30 minutes to run on a interactive RM node on psc bridges-2 with 16 cores.

If you have single reads data, run the pipeline for single reads data via

vim config.yaml # confirm genomeDir is the directory of STAR hg38 genome index. No need to change if the hg38 genome index is built from install.sh.
bash master.sh SE {cores} {sample_r1}

replacing {cores} with the number of cores you have available, replaccing {sample_r1} with the name of your fastq RNA-seq file, but without the .fastq extension. At this step, sample fastq file should be in /data and named as sample_r1.fastq.

If you have paired end reads data, run the pipeline for paired end reads data via

vim config.yaml # confirm genomeDir is the directory of STAR hg38 genome index. No need to change if the hg38 genome index is built from install.sh.
bash master.sh PE {cores} {sample}

replacing {cores} with the number of cores you have available, replaccing {sample} with the name of your sample RNA-seq. At this step, sample fastq files should be in /data and named as sample_r1.fastq and sample_r2.fastq.

Output files description

All of output file for single read sample_r1.fastq or paired end sample_r1.fastq and sample_r2.fastq will be put into a directory with the same name of your sample, inside the Virus-Detection folder. Inside this folder, there will be the following:

Trimmed sequences of the raw sequencing files named _trimmed.fasta in /trimmed_fastq directory.
Quality control of the raw sequence data named _fastqc.html and _fastqc.zip in /fastqc_report directory.
RNA-seq alignment to human genome named Aligned.out.sam in /star_aligned directory.
A summary of the RNA-seq alignment to human genome named Log.final.out in /star_aligned directory. (for more information related to STAR output in the /star_aligned directory, refer STAR User Manual)
RNA-seq reads unmapped to human genome in bam format named aligned_unmapped.bam in /star_unmapped directory.
RNA-seq reads unmapped to human genome in fastq format. aligned_unmapped.fq for single read data, or aligned_unmapped1.fq and aligned_unmapped2.fq for paired end data in /star_unmapped directory.
Assembled contigs named final.contigs.fa in /assembled_contigs directory.
BLAST report named blast_out.txt in /blast_result directory.
Open Reading Frame report named contigsWithOrf.fasta in /ORFfinder directory.
Secondary RNA strunctures named secondary_structure.str in /RNAfold directory.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
envs		envs
paired_end_pip		paired_end_pip
single_end_pip		single_end_pip
README.md		README.md
Snakefile		Snakefile
config.yaml		config.yaml
install.sh		install.sh
master.sh		master.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Virus-Detection

Authors

Installation

Install Snakemake via Conda

Download and `cd` into this repository

Required input files

Configure the workflow

Edit the config.yaml

Run the workflow

Output files description

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Virus-Detection

Authors

Installation

Install Snakemake via Conda

Download and cd into this repository

Required input files

Configure the workflow

Edit the config.yaml

Run the workflow

Output files description

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Download and `cd` into this repository

Packages