-
Notifications
You must be signed in to change notification settings - Fork 12
GenotypeQuality
How the HaplotypeCaller's reference confidence model works:
https://software.broadinstitute.org/gatk/documentation/article.php?id=4042
GenotypeLikelihoods:
https://software.broadinstitute.org/gatk/documentation/article.php?id=4442
What is gVCF:
https://software.broadinstitute.org/gatk/documentation/article.php?id=4017
TODO:
- check out avocado and guacamole project
- check out GATK spark functionality
AVOCADO: For highest accuracy, Avocado is run as a two phase tool. In the first phase, we reassemble or realign our reads around INDEL variants. In the second phase, we apply a probabilistic model built around a biallelic model to the reads to identify variants.
Our approach does not rely on the input reads being sorted, and as such, is not unduly impacted by variations in coverage across the genome. This point is critical in a parallel approach, as coverage can vary dramatically across the genome
We then use Apache Spark’s reduceByKey functionality to compute the number of times each variant was observed with high quality. We do this to discard sequence variants that were observed in a read that represent a sequencing error, and not a true variant. (czemu od razu nie odfiltrowuja takich readow?) https://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-204.pdf [chapter 7]