Skip to content

Latest commit

 

History

History
135 lines (69 loc) · 9.67 KB

File metadata and controls

135 lines (69 loc) · 9.67 KB

Just some links to software, tools etc. to help analyse pooled sequence data.

While this page is a bit disorganized it contains links to papers and software of relevance to how we go about calling SNPs for population genomic analysis in particular for pooled data.

This is why we need to obsess about how we map reads, filter, indel re-align and call SNPs...

Tajima's D in windows along a chromosome. Same population, same data, different SNP calling. #SundayWTF pic.twitter.com/oPcVDoac9X

— Jeffrey Ross-Ibarra (@jrossibarra) February 25, 2018
<script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

Drosophila Genome Nexus and PopFly. i.e. Drosophila genomic data to play with or use for comparison.

Drosophila Genome Nexus. Links to over 1000 Drosophila melanogaster genomes, and importantly the variants detected among these! here is the link to the file.

PopFly. The awesome interactive browser for population genomics using the Drosophila Genome nexus (including subsets).

General tools useful for many large sequencing experiments (i.e. for dealing with .fastq, .SAM, .BAM or .vcf)

seqtk. Sequence tool kit. converting file formats, extracting certain reads or parts etc..

samtools. In addition to examining SAM and BAM files (flagstat, view, stat), useful for converting and filtering etc...

bcftools. A companion in many ways to samtools

vcftools. Compare, edit convert vcf files etc.

bedtools. for extracting certain regions/intervals etc..

Generally useful tools for sequence QC

Also see the markdown file on sequence QC.

NGSCheckMate: software for validating sample identity in next-generation sequencing studies within and across data types. github repo is here

Viewing mapped reads, polymorphisms etc (for SAM, BAM, VCF etc).

blog post reviewing different alignment viewers ones

[tview] see samtools

IGV

tablet

IGB

highlander Also variant calling..

pybamview

readXplorer

ving In R/bioconductor

alview

seqmonk

Identifying polymorphisms (Software that can work well for pooled data).

Evaluation of variant detection software for pooled next-generation sequence data. A couple of important points from this paper. First GATK works well, but takes a long time even with small pools, and fails with large pools like we have used. While VarScan has low FP rates, its sensitivity and detecting true SNPs was VERY LOW. CRISP and LoFreq do pretty well, but CRISP seems to edge out LoFreq for pools with larger number of samples. Everything is based on default parameters though, so be aware.

Best Practices for data preparation for calling SNPs. Some parts are relevant (i.e. map to reference, mark duplicates, realign around small indels etc.). See this

FreeBayes. Seems to have a good option for pooled data.

VarScan. Seems like it has an active community.

CRISP.Comprehensive Read Analysis for Identification of SNVs (and short indels) from Pooled sequencing data. While the original method is from 2010, it seems (based on github repo) to still be in active use? Not clear. It sounds like compiling it with the new version of SAMtools may not work.

LoFreq

SHORE. Which is apparently a useful pipeline for pooled data as well.

snape. Not such an active community?

SNPeff. Genetic variant annotation and effect prediction toolbox. Not for identifying SNPs, but for annotating them.

I have not included ANGSD as I do not think it works on pooled data. However it looks like it may be a useful pipeline for variant calling and simple population genomic analysis.

Similarly, while GATK is (?) one of the "industry standards" for variant calling, in our hands we continue to have major issues with variant calling on large pools. However papers from Molly Burke, Tony Long, etc.. all seem to have used it successfully with pooled data, so we need to get in contact with them to see if it is just an issue of some parameter differences.

Using two different mapping tools and comparing VCFs

Suitability of Different Mapping Algorithms for Genome-wide Polymorphism Scans with Pool-Seq Data

allele frequencies and basic evolutionary parameters.

PoPoolation. Estimating basic evolutionary parameters from pooled data. Lots of options for doing this in some of the tools above, or in R as well.

CLEAR. This a new method (and associated software) for longitudinal sequence data from E&R type experiments, that is designed for the kinds of experiments we employ. The paper is on birxiv here.

SNPGenie: estimating evolutionary parameters to detect natural selection using pooled next-generation sequencing data. Link to code on github here

Genotype-Frequency Estimation from High-Throughput Sequencing Data. Link to software is here.

correcting biases in allele frequency estimation with some machine learning approaches

LDx Estimation of Linkage Disequilibrium from HighThroughput Pooled Resequencing Data

Pool-hmm. A Python program for estimating the allele frequency spectrum and detecting selective sweeps from next generation sequencing of pooled samples.

Nest. Simulate allele frequency trajectories AND estimate N_e for (from?) Pool-seq time series data.

PoolSeq. Analyze and simulate time series Pool-seq data (E&R). Allows estimation of N_e and quantification of s, as well as dominance.

PoPoolation2. Comparison of allele frequencies among populations or treatments (BSA) from pooled sequence data. This has fairly limited applications as it really just uses a CMH test for multidimensional contigency tables. CLEAR and pool-seq are better options now I think.

Some reasons to be optimistic (or not) about pooled sequencing.

Validation of SNP allele frequencies determined by pooled next-generation sequencing in natural populations of a non-model plant species

(https://www.ncbi.nlm.nih.gov/pubmed/23730833)

another paper demonstrating pretty good concordance between pool seq and individual sequencing for allele frequencies

Next Generation Sequencing of Pooled Samples: Guideline for Variants' Filtering

Estimation of population allele frequencies from next-generation sequencing data: pool-versus individual-based genotyping

several library prep strategies give similar allele frequencies

A new (2017) paper showing how number of individuals and among individual contribution to pools influences allele frequencies. Link Here

and some or nots

Population-genetic inference from pooled-sequencing data. The paper "Genotype-Frequency Estimation from High-Throughput Sequencing Data" uses the method discussed here I think.

The Power to Detect Quantitative Trait Loci Using Resequenced, Experimentally Evolved Populations of Diploid, Sexual Organisms . Power analyses for E&R experiments with pooled data.

Assorted other things

The Marth Lab seems to have lots of new tools in development for variant detection etc..

A list of file formats that we care about