Skip to content

Consider splitting reads when multiple genotypes are present #384

@donkirkby

Description

@donkirkby

Chanson found some strange results when processing the same samples in two runs: 2x250bp and 2x150bp. One of the main concerns was the appearance of a second genotype in several samples, but the second genotype had spotty coverage. Perhaps the shorter reads in 2x150bp runs allowed some reads to neatly fit a conserved region.

Maybe we need to look at the coverage levels when choosing which genotypes to keep during remap. Perhaps it would be helpful to split the reads in half and see if both halves still map to the same reference.

Here's the full description of the problems from Chanson:

The three sets of runs are:
Run 1: 9-Sep-2016.M04401 (2x150) and 9-Sep-2016.M01841 (2x250)
Run 2: 9-Dec-2016.M04401 (2x150) and 9-Dec-2016.M01841 (2x250)
Run 3: 16-Dec-2016.M04401 (2x150) and 16-Dec-2016.M01841 (2x250)

  1. In general there are more cases of some reads mapping to a "wrong" second genotype in the 2x150bp runs. So, more random "black" boxes in micall for the 2x150 that are grey in 2x250.
    Run1: 73265A (NS3 - GT1a), 73266A (NS3, NS5a, NS5b - GT1b), 73270A (NS3 - GT1b. here, the 2x250 is the one being stupid) Run3: 73512A (NS3, NS5a, NS5b - GT1b)
  2. There is one case where the 2x150bp reads map to the wrong genotype
    and fails to produce full coverage across the gene, while the 2x250
    reads map to the correct genotype and passes
    Run2: E105648 MIDI (NS5b): 2x250 correctly maps to GT1a, 2x150 maps to
    GT4r and leaves a huge hole in the middle
  3. One huge difference in AA frequency, possibly due to reads mapping to
    the wrong seed (see bowtie2 crashes with page fault on filter_cross_contaminants #1 above)?
    Run1: 73265A (NS3-GT4a): 2x150 claims 99% "E" at codon 570, 2x250 says
    it's 99% "K"
  4. One huge difference in AA frequency due to indels in one of the
    sample-specific consensuses (2x150 is right here)
    Run1: 73265A (NS5a-GT4a): The 2x250 result claims that there is an
    insertion after codon 234, and a corresponding (partial) deletion at
    codon 238. The 2x150 has neither of these
  5. There are a few cases of large-ish differences in AA% between runs,
    but these are mostly due to low coverage overall. One minor exception:
    Run2: 73491A (NS5a-GT1a): Codon 219 is 56% A, 44% G in 2x150; but 35% A
    65% G in 2x250. I'm not too worried about this one.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions