Skip to content
Bárbara Bitarello edited this page Jul 23, 2021 · 14 revisions

FAQs

Do I really need to run simulations to use NCD?

There are (at least) two ways to use genome-wide scans for natural selection. Generally, you have a statistic of interest (e.g., NCD). You want to know whether your species of interest has any genomic windows that are suggestive of balancing selection. Below I describe two frequently used approaches:

1) Outlier approach.

You calculate the statistic of interest for windows across the genome and decide on an arbitrary cut-off below/above which you consider the value 'significant'. What you are getting with this is an arbitrary set of the 'lowest p-value' windows in your genome of interest, i.e, how uncommon is this pattern you observe in the outlier windows.

Pros:

-no need for simulations. In the absence of demographic information this might be crucial.

-you will always have outlier windows, by definition

-it is a quick approach for, say, gene-set enrichment analyses such as the one done in this MBE paper

Cons:

-the cutoff is arbitrary

-no information of amount of false positives amongst the 'outlier' windows

-because outliers exist by definition, it follows that even in the complete absence of natural selection you would still have outlier windows.

Advice: -if possible, run some simulations (see below)

-if you can't run simulations, use the results carefully. Resist the urge to say that the outlier windows are 'targets of balancing selection'

-as a sanity check, verify that your outlier windows don't have some other property (unrelated to natural selection) that might explain their ranking

2) Simulation-based approach

You calculate the statistic of interest for windows across the genome and run (neutral) simulations using demographic parameters that mimic the demographic history of your species of interest. Then, you use a cutoff based on what you observed in the simulations. For example, if the lowest NCD value you observe in the simulations is X then you could apply a cutoff of X to your empirical scan. What you are getting with this is a set of windows that is unusual, or unlikely to occur in the absence of selection. IF you want to be even more thorough you could also run simulations with selection, and that would allow you to do power analyses for the test in your species of interest.

Pros:

-non-arbitrary

-tells you how unexpected your empirical values are

-allows you to explore and rule out possible confounding factors

Cons:

-highly dependent on the model and some of its parameters

-more time consuming

Advice:

-make sure your model parameters allow you to recreate important patterns used in the statistics. E.g. for NCD2 that's the SFS and the number of fixed differences with a sister species

-usually the most important parameters for NCD2 would be the divergence time from the sister species and a reasonable ballpark for historical effective population size

Bottom line:

Simulations allow for a less arbitrary approach, while outlier approaches are free from demographic model assumptions. If at all possible, you should run neutral simulations to have a sense of the statistic's behavior in your species of interest, and use that to inform your cutoff approach. A combined approach, as described in our paper, is a great choice.

I want to use NCD for my species of interest but I don't know how to run simulations. Could you share the msms commands you used in the paper?

Yes! Here they are.

Is NCD a summary statistic? Why not use a more elaborate model, or machine learning?

Yes, NCD statistics quantify summary aspects of the data. As with every summary statistic, some information is lost. In practice, we have found that for long-term balancing selection, extending hundreds of thousands to millions of years, NCD performs remarkably well without the need for complicated modeling and/or frameworks. For recent balancing selection, however (less than 1 mya), we recommend not using NCD, but instead perhaps a machine learning approach like the one described here.

Do I need to have fixed differences (i.e, number of substitutions between my species of interest and a sister species) to run NCD?

NCD encompasses two tests described in our paper: NCD1 and NCD2. The well-powered test that we recommend and discuss in-depth in the paper is NCD2. NCD2's strength comes from combining two hallmark signatures of long-term balancing selection: an abundance of alleles segregating at intermediate frequency and an excess of polymorphic sites compared to fixed differences (or substitution) sites (see figure 3 here). In the paper, we looked at humans, and the sister species used was chimpanzees. In the absence of fixed differences information, NCD2 becomes NCD1 and looks only at the site-frequency spectrum. For very long-term selection (5 my) and for extremely intermediate frequencies (target frequency=0.5), NCD1 and NCD2 have similar performance, but for 3 my and other target frequencies (0.3, 0.4) NCD2 outperforms NCD1 substantially, which is why we recommend NCD2 if you are able to provide fixed differences info.

What is the target frequency? Can I just use 0.5?

Because NCD focuses on long-term balancing selection, we assume that allele frequencies have achieved an equilibrium. Our simulations of balancing selection in msms reflect that. Assuming, for example, a bi-allelic locus under symmetric overdominance (heterozygote advantage), both homozygotes have equal relative fitness. The frequency equilibrium is then 0.5. Any deviation from such perfect symmetry (i.e, one homozygotes has high relative fitness than the other) will lead to frequency equilibria $\neq$0.5. That is why we also explored other frequency equilibria (0.4, 0.3) and verified if we hadpower to detect those instances when using a target frequency of 0.5 in the NCD equation. Under ideal scenarios (very long-term selection), NCD2(tf=0.5) does a good job at detecting instances where frequency equilibrium is 0.4 or even 0.3. So in principle you can just use 0.5. If you have performed simulations and want to do a more thorough analysis, you could also consider looking at 0.4 or 0.3, because matching target frequency and frequency equilibrium will always lead to less false positives (see table 2 in our paper).

Do I need to use msms? Can I use another simulator?

No, you don't. We used msms because it is a time-efficient coalescent simulator framework that also allows for selection to be included. We needed to be able to include selection for the power analyses. If you want to do this as well, I still recommend it. An alternative is SLIM (1,2,3). But if you just want to run neutral simulations you have many options, including the original ms, upon which msms was developed. If you do decide to use msms, here it is. There is also an R package that runs msms called coala.

Also, msms hasn't been updated in a long time, so you should consider using MSprime (if only simulating neutrality) or SLIM (if you want to include selection). This community-maintained library of population genetic models is amazing!.

I am still having trouble. Can I email you?

Yes, you can. Please be as precise as you can in your question. If there's some kind of R error, please send the script you're using. I cannot guarantee to reply immediately, though. In any case, I will probably add your question (without your name) to this wiki so that it benefits others. bbitarello [at] brynmawr [dot] edu