-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Chanson found some strange results when processing the same samples in two runs: 2x250bp and 2x150bp. One of the main concerns was the appearance of a second genotype in several samples, but the second genotype had spotty coverage. Perhaps the shorter reads in 2x150bp runs allowed some reads to neatly fit a conserved region.
Maybe we need to look at the coverage levels when choosing which genotypes to keep during remap. Perhaps it would be helpful to split the reads in half and see if both halves still map to the same reference.
Here's the full description of the problems from Chanson:
The three sets of runs are:
Run 1: 9-Sep-2016.M04401 (2x150) and 9-Sep-2016.M01841 (2x250)
Run 2: 9-Dec-2016.M04401 (2x150) and 9-Dec-2016.M01841 (2x250)
Run 3: 16-Dec-2016.M04401 (2x150) and 16-Dec-2016.M01841 (2x250)
- In general there are more cases of some reads mapping to a "wrong" second genotype in the 2x150bp runs. So, more random "black" boxes in micall for the 2x150 that are grey in 2x250.
Run1: 73265A (NS3 - GT1a), 73266A (NS3, NS5a, NS5b - GT1b), 73270A (NS3 - GT1b. here, the 2x250 is the one being stupid) Run3: 73512A (NS3, NS5a, NS5b - GT1b)- There is one case where the 2x150bp reads map to the wrong genotype
and fails to produce full coverage across the gene, while the 2x250
reads map to the correct genotype and passes
Run2: E105648 MIDI (NS5b): 2x250 correctly maps to GT1a, 2x150 maps to
GT4r and leaves a huge hole in the middle- One huge difference in AA frequency, possibly due to reads mapping to
the wrong seed (see bowtie2 crashes with page fault on filter_cross_contaminants #1 above)?
Run1: 73265A (NS3-GT4a): 2x150 claims 99% "E" at codon 570, 2x250 says
it's 99% "K"- One huge difference in AA frequency due to indels in one of the
sample-specific consensuses (2x150 is right here)
Run1: 73265A (NS5a-GT4a): The 2x250 result claims that there is an
insertion after codon 234, and a corresponding (partial) deletion at
codon 238. The 2x150 has neither of these- There are a few cases of large-ish differences in AA% between runs,
but these are mostly due to low coverage overall. One minor exception:
Run2: 73491A (NS5a-GT1a): Codon 219 is 56% A, 44% G in 2x150; but 35% A
65% G in 2x250. I'm not too worried about this one.