While this page is a bit disorganized it contains links to papers and software of relevance to how we go about calling SNPs for population genomic analysis in particular for pooled data.
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>Tajima's D in windows along a chromosome. Same population, same data, different SNP calling. #SundayWTF pic.twitter.com/oPcVDoac9X
— Jeffrey Ross-Ibarra (@jrossibarra) February 25, 2018
Drosophila Genome Nexus and PopFly. i.e. Drosophila genomic data to play with or use for comparison.
Drosophila Genome Nexus. Links to over 1000 Drosophila melanogaster genomes, and importantly the variants detected among these! here is the link to the file.
PopFly. The awesome interactive browser for population genomics using the Drosophila Genome nexus (including subsets).
General tools useful for many large sequencing experiments (i.e. for dealing with .fastq, .SAM, .BAM or .vcf)
seqtk. Sequence tool kit. converting file formats, extracting certain reads or parts etc..
samtools. In addition to examining SAM and BAM files (flagstat, view, stat), useful for converting and filtering etc...
bcftools. A companion in many ways to samtools
vcftools. Compare, edit convert vcf files etc.
bedtools. for extracting certain regions/intervals etc..
Also see the markdown file on sequence QC.
NGSCheckMate: software for validating sample identity in next-generation sequencing studies within and across data types. github repo is here
blog post reviewing different alignment viewers ones
[tview] see samtools
highlander Also variant calling..
ving In R/bioconductor
Evaluation of variant detection software for pooled next-generation sequence data. A couple of important points from this paper. First GATK works well, but takes a long time even with small pools, and fails with large pools like we have used. While VarScan has low FP rates, its sensitivity and detecting true SNPs was VERY LOW. CRISP and LoFreq do pretty well, but CRISP seems to edge out LoFreq for pools with larger number of samples. Everything is based on default parameters though, so be aware.
Best Practices for data preparation for calling SNPs. Some parts are relevant (i.e. map to reference, mark duplicates, realign around small indels etc.). See this
FreeBayes. Seems to have a good option for pooled data.
VarScan. Seems like it has an active community.
CRISP.Comprehensive Read Analysis for Identification of SNVs (and short indels) from Pooled sequencing data. While the original method is from 2010, it seems (based on github repo) to still be in active use? Not clear. It sounds like compiling it with the new version of SAMtools may not work.
SHORE. Which is apparently a useful pipeline for pooled data as well.
snape. Not such an active community?
SNPeff. Genetic variant annotation and effect prediction toolbox. Not for identifying SNPs, but for annotating them.
I have not included ANGSD as I do not think it works on pooled data. However it looks like it may be a useful pipeline for variant calling and simple population genomic analysis.
Similarly, while GATK is (?) one of the "industry standards" for variant calling, in our hands we continue to have major issues with variant calling on large pools. However papers from Molly Burke, Tony Long, etc.. all seem to have used it successfully with pooled data, so we need to get in contact with them to see if it is just an issue of some parameter differences.
Suitability of Different Mapping Algorithms for Genome-wide Polymorphism Scans with Pool-Seq Data
PoPoolation. Estimating basic evolutionary parameters from pooled data. Lots of options for doing this in some of the tools above, or in R as well.
CLEAR. This a new method (and associated software) for longitudinal sequence data from E&R type experiments, that is designed for the kinds of experiments we employ. The paper is on birxiv here.
SNPGenie: estimating evolutionary parameters to detect natural selection using pooled next-generation sequencing data. Link to code on github here
Genotype-Frequency Estimation from High-Throughput Sequencing Data. Link to software is here.
correcting biases in allele frequency estimation with some machine learning approaches
LDx Estimation of Linkage Disequilibrium from HighThroughput Pooled Resequencing Data
Pool-hmm. A Python program for estimating the allele frequency spectrum and detecting selective sweeps from next generation sequencing of pooled samples.
Nest. Simulate allele frequency trajectories AND estimate N_e for (from?) Pool-seq time series data.
PoolSeq. Analyze and simulate time series Pool-seq data (E&R). Allows estimation of N_e and quantification of s, as well as dominance.
PoPoolation2. Comparison of allele frequencies among populations or treatments (BSA) from pooled sequence data. This has fairly limited applications as it really just uses a CMH test for multidimensional contigency tables. CLEAR and pool-seq are better options now I think.
(https://www.ncbi.nlm.nih.gov/pubmed/23730833)
Next Generation Sequencing of Pooled Samples: Guideline for Variants' Filtering
several library prep strategies give similar allele frequencies
A new (2017) paper showing how number of individuals and among individual contribution to pools influences allele frequencies. Link Here
Population-genetic inference from pooled-sequencing data. The paper "Genotype-Frequency Estimation from High-Throughput Sequencing Data" uses the method discussed here I think.
The Power to Detect Quantitative Trait Loci Using Resequenced, Experimentally Evolved Populations of Diploid, Sexual Organisms . Power analyses for E&R experiments with pooled data.
The Marth Lab seems to have lots of new tools in development for variant detection etc..