Skip to content

Latest commit

 

History

History
92 lines (63 loc) · 6.14 KB

File metadata and controls

92 lines (63 loc) · 6.14 KB

RNASeq Workflow

NOTICE: This protocol is provided for informational purposes only. Although, originally based off of GATK and others' best practices at the time of development (~4-5 years ago), this may not be the case now. Any re-processing efforts should make use of current tools and best practices.

Those interested in the processed BeatAML data should retrieve the harmonized versions from the GDC (coming soon)

Environment setup

Below is the expected setup for the programs and metadata files:

  • resource_dir
    • pipeline
      • Homo_sapiens.GRCh37.75.gtf--GTF file from Ensembl
      • subread v1.5.0-p2/
        • Subread index files from human_g1k_v37.fasta (see the WES README)
      • tophat
        • ensGene.txt
        • refGene.txt
        • blast/--NCBI BLAST human_genomic and nt database files
    • programs
      • Trimmomatic-0.36/
      • subread-1.5.0-p2/
      • bowtie-1.1.1/
      • ncbi-blast-2.2.30+/
      • samtools-1.1/
      • tophat-2.0.14-build/

This protocol utilizes GATK Queue to manage and parallelize the workflow. The main Queue.jar file (v3.3-3-g3e87c07) can reside in resource_dir/programs above or elsewhere.

1. Stage Fastq data

The following is assumed for the below protocol:

  1. The data are paired-end and are of length 100 bases generated by an Illumina sequencer
  2. The data are stranded with the reads derived from the reverse strand, however in this case the protocol can be modified to account in the alignment step to deal with non-stranded or stranded reads that are derived from the forward strand.

Create an appropriate folder indicating the samples that are to be run for instance SampleGroup10/FlowCell5. In this directory the fastq data should be placed with the top level directory set as the flowcell identifier (e.g. SampleGroup10/FlowCell5/160607_D00735_0122_AC9561ANXX).

The below workflow further assumes that the fastq files are in a subdirectory prefixed by 'RNA' and have the read 1 files named like ^RNA.*_R1_.*fastq.gz$. Samples are assumed to be named like '\d{2}-\d{5}'.

2. Summarize the FastQC results provided by the core.

From within the SampleGroup folder, an HTML summary of the Fastqc files provided by the sequencing core can be produced.

bash$ R
> source("/path/to/beataml_genomics/baml_helpers.R")
> make.fastqc.report("FlowCell1/160607_D00735_0122_AC9561ANXX/FastQC/", report.name="SGX_FC1_FastQC.html")

This generates an HTML document called SGX_FC1_FastQC.html which contains a table of all the FastQC plots and an overall summary.

3. Alignments and Gene-level Counting

For alignments we use the subread suite of tools. Included are the subjunc gapped aligner and the featureCounts utility which can provide read counts relative to specified gene modes. Also, note that trimming of reads at both 3' and 5' end is first done by trimmomatic similar to the WES protocol. Although subjunc can automatically trim the reads prior to aligning, the processed FASTQ files are also needed for the TopHat protocol below. The subjunc/featureCounts version is set inside the run_gatk.sh script and is used to find the appropriate indices and executables. An example of how to run the protocol given below:

bash$ (time sh run_gatk.sh subjunc_rnaseq /home/exacloud/lustre1/NGSdev/rnaseq_out/SampleGroup9 fr 2 trim FlowCell1/160706_D00735_0130_AC9FC3ANXX/RNA160513JT/RNA160513JT_* ) >> SG9_fc1_subjunc_count.log 2>&1

The fr notation should indicate the orientation of the read pairs as is supplied to subjunc. Similarly 2 indicates that the prep is stranded but on the negative strand.

Note that the protocol assumes that there is something in the absolute path of each FASTQ file that indicates SampleGroup and FlowCell. If not, it will default to SampleGroupX and FlowCellY.

Within the output directory the following are created:

  • subjunc_alignments/ : This contains the subjunc output BAM files & also Sorted BAM files. This folder will also have the Splice junction outputs for each sample as *.bed files.
  • logs/ : All the .err, .out files that contain the stderr outputs from subjunc and featureCounts are captured in this folder. These *.err files are later on used to extract alignment metrics.
  • *.featureCounts.txt & *.featureCounts.txt.summary files. Note that w.r.t to featureCounts outputs they are output per sample and there is also a combined_featureCounts.txt file that will have all the samples combined in a single file.

4. Creation of expression matrices

In order to make the featureCounts output more user friendly gene symbols are added and the ';' delimited columns which are provided per exon are summarized to more useful values at the gene level. A list of paths to the combined featureCounts output from the alignment step is necessary. This can be done per flowcell, but is probably most useful per SampleGroup.

bash$ R
> source("/path/to/beataml_genomics/baml_helpers.R")
> annotate.gene.counts(count.matrices=c("FlowCell1/combined_featureCounts.txt", "FlowCell2/combined_featureCounts.txt"), out.file.base="SampleGroup9_subjunc_v1_5_p2", include.log=T)

Three files are written to the current directory:

  • a raw counts matrix ("rawcounts.csv")
  • a counts per million (CPM) matrix ("cpm.csv")
  • optionally a log2 CPM matrix ("log2_cpm.csv")

5. Tophat-Fusion to identify Gene Fusions

The Tophat-fusion approach is currrently used to detect fusions in RNA-Seq. As this program is based on the tophat spliced realigner as opposed to subjunc, it has to be run from scratch given the 'processed' FASTQ files in the SampleGroupX/FlowCellY folders. The current approach is to run tophat on every sample and then batch samples in groups (as specified to run_gatk.sh). This is due to tophat-fusion having a rather long runtime and requiring that it is to be run in a folder of one or more samples. The resulting output file, summarized from each run group ('tophat_fusion_results_with_blast.txt') is produced in the specified output directory as is the one to be distributed. It can be run as follows:

bash$ (time sh run_gatk.sh tophat_fusion_search TophatFusion 5 FlowCell*/processed_fastq/*.trim.fastq.gz) >> sg9_tophat_fusion.log 2>&1