-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Turner Duncan 10/30/2017 STAT 555 Project Proposal
The goal of my proposed project is to determine possible errors in RNA-Seq differential expression analysis when different qualities (High Quality vs Lower Quality) of reference genome is used in the RNA-Seq analysis pipeline. I would perform this project using two different Zebra Finch (Taeniopygia guttata) genomes 1) a high quality Long Read Reference Assembly (LRR) and a lower quality Short Read Reference assembly. The details of two reference assemblies I would use are below:
Long Read Reference (LRR) Tgut_diploid_1.0 (Pacific Biosciences) https://www.ncbi.nlm.nih.gov/assembly/GCA_002008985.2/ Total sequence length 1,982,686,095 Total assembly gap length 0 Number of contigs 3,347 Contig N50 4,297,012 Contig L50 119
Short Read Reference (SRR) Taeniopygia_guttata-3.2.4 (Washington University) https://www.ncbi.nlm.nih.gov/assembly/GCF_000151805.1 Total sequence length 1,232,135,591 Total assembly gap length 9,270,900 Number of contigs 124,806 Contig N50 38,639 Contig L50 8,016
It has been shown in recently published work that the most recent LRR for the Zebra Finch is ~150 fold more contiguous than the previous SRR. https://academic.oup.com/gigascience/article/doi/10.1093/gigascience/gix085/4096262/De-Novo-PacBio-long-read-and-phased-avian-genome#97710651
Additionally the LRR used the PacBio Iso-Seq (full length isoform) method instead of Illumina based RNA-Seq for transcriptome annotation and found 5.2% more isoforms than the previous reference genome. http://www.pacb.com/wp-content/uploads/Vierra-G10K-2017-From-RNA-to-Full-Length-Transcripts-1.pdf
Given the greatly improved Zebra Finch reference is a recent release (Oct 1, 2017). I imagine anyone who performed a differential expression analysis with Zebra Finch as a model organism prior to Oct 1, 2017 would get different results because they used a significantly inferior SRR reference for the mapping step of their RNA-Seq analysis. I would like to quantify the different results one would get..
To examine this I propose to use data from the publication Transcriptional response to West Nile virus infection in the zebra finch (Taeniopygia guttata) that performed differential expression analysis using the SRR genome as a reference for the mapping step of RNA-Seq analysis. For my project I will reproduce parts of this publication using the SRR assembly as well as perform novel differential expression comparisons using the newly released LRR assembly as a mapping step for RNA-Seq analysis.
In this publication they compare differential transcript expression of female Taeniopygia guttata individuals with West Nile Virus (WNV) as a control and at 4 days post infection by performing an Illumina RNA-Seq data on a HiSeq2500 using paired 2x100bp transcriptiomic cDNA. The samples for this project can be downloaded on SRA: Sample 1 – Control SRR5001851 a. https://www.ncbi.nlm.nih.gov/biosample/SAMN05981661 Sample 2 - Control SRR5001848 a. https://www.ncbi.nlm.nih.gov/biosample/SAMN05981662 Sample 3 – Control SRR5001850 a. https://www.ncbi.nlm.nih.gov/biosample/SAMN05981663
Sample 1 - 2 DPI SRR5001849
Sample 2 - 2 DPI SRR5001845
Sample 3 - 2 DPI SRR5001843
Sample 1 - 4 Days Post Infection (DPI) SRR5001844 a. https://www.ncbi.nlm.nih.gov/biosample/SAMN05981679 Sample 2 - 4 Days Post Infection (DPI) SRR5001847 a. https://www.ncbi.nlm.nih.gov/biosample/SAMN05981680 Sample 3 - 4 Days Post Infection (DPI) SRR5001846 a. https://www.ncbi.nlm.nih.gov/biosample/SAMN05981681
The pipeline I propose to get expression data for each sample is:
- Alignment via hisat2
- Quantification via featureCounts
- Differential expression via DESeq2
I learned this pipeline in a previous class and I have the computation power/knowledge to run the alignments for each sample. https://www.biostarhandbook.com/rnaseq/rnaseq-griffith-control.html
The Differential Expression comparisons I plan make are outlined below. One of them is the same DE analysis that is in the previously published paper, however the other three are new comparisons that I am making. I should be able to reproduce the first comparison below. The other comparisons will allow me to characterize the genes that differentially express between using different qualities of reference genomes.
SRR Control vs SRR Infected (This would be the same as Control vs 4DPI in the previously published paper and would show that my scripts are correct) LRR Control vs LRR Infected (This would be unique to my project) SRR vs LRR (This is one I am doing differently and would show which genes are differentially expressed between SRR and LRR reference genomes) Control vs Infected (This would be unique to my project)
I should also be able to compare the % of reads that map to each reference genome for each sample using Hisat2. Here I would expect a higher % of reads to map to the LRR assembly compared to the SRR assembly.