This is a Snakemake workflow for running SINGER (MCMC sampling of ancestral recombination graphs) in parallel, loosely based on the python wrappers included with the binary. The genome is discretized into chunks, and SINGER is run on each chunk with parameters adjusted to account for missing sequence and recombination rate heterogeneity. Chunks are merged into a single tree sequence per chromosome and MCMC replicate. Some diagnostic plots are produced at the end, that compare summary statistics to their expectations given the ARG topology. Pair coalescence rates are calculated from the tree sequences and plotted.
Please cite SINGER (https://doi.org/10.1101/2024.03.16.585351) and POLEGON (https://doi.org/10.1101/2025.02.14.638385) if you use this pipeline. Note that I'm not one of the authors of SINGER, and put this pipeline together for my own research, so caveat emptor.
Using git and mamba:
git clone https://github.com/nspope/singer-snakemake my-singer-run && cd my-singer-run
mamba env create -f environment.yaml
mamba activate singer-snakemake-env
snakemake --cores=20 --configfile=config/example_config.yamlThe input files for each chromosome are:
- <chromosome_name>.vcf.gz gzip'd VCF that can be used as SINGER input, either diploid and phased or haploid with an even number of samples. This should represent an unbiased sample of segregating variation in the population (no ascertainment bias).
- <chromosome_name>.mask.bed (optional) bed file containing inaccessible intervals. These are intervals that a priori could not be sequenced reliably. For example, these might be structural variants that are missing in some of the samples. Post-hoc filtered SNPs should go in a separate file, described below.
- <chromosome_name>.hapmap (optional) recombination map in the format described in the documentation for
msprime.RateMap.read_hapmap(see here) - <chromosome_name>.meta.csv (optional) csv containing metadata for each sample in the VCF, that will be inserted into the output tree sequences. The first row should be the field names, with subsequent rows for every sample in the VCF. The column pointed to by the
stratify-byargument in the config will be used to fill out population information in the tree sequence. - <chromosome_name>.filter.txt (optional) text file containing 1-based positions of omitted SNPs (one per line). These may be present in the VCF (in which case they are removed by the pipeline) or not (in which case they are still used to calculate an adjustment to mutation rate, see below). For example, these might be SNPs that could not be called accurately in some samples, or where the ancestral state could not be reliably determined.
see the examples in example/* (generated by resources/scripts/simulate-data.py).
It's a good idea to run chromosome arms separately, so that the chunks flanking
the centromere don't have long stretches of missing data.
A template for the configuration file is in configs/example_config.yaml:
# --- example_config.yaml ---
input-dir: "example" # directory with input files per chromosome, see README
output-dir: "results" # directory where to place output
chromosomes: null # a list of chromosomes to use (e.g. "chr1"), if `null` use all `vcf.gz` in `input-dir`
mutation-rate: 1.0e-8 # per base per generation mutation rate
recombination-rate: 1.0e-8 # per base per generation recombination rate, ignored if hapmap is present
polarised: True # are variants polarised so that the reference state is ancestral
stratify-by: "population" # column in metadata to use for population table in tree sequence (or `null`)
random-seed: 1 # random seed
# --- discretization options ---
max-chunk-size: 1.0e6 # target size in base pairs for each chunk run through SINGER
min-chunk-size: 0.1e6 # discard chunks smaller than this (e.g. isolated between gaps)
min-gap-size: 1.0e6 # minimum size in base pairs for a masked interval to be converted to a gap
max-missing: 0.8 # ignore chunks with more than this proportion of masked sequence or filtered variants
min-sites: 500 # ignore chunks with fewer than this number of accessible polymorphic sites
min-branch-length: 1.0e-7 # minimum possible branch length in generations in tree sequence
# --- SINGER options ---
singer-mcmc-samples: 100 # number of MCMC samples (each sample is a tree sequence)
singer-mcmc-thin: 10 # thinning interval between MCMC samples
singer-mcmc-burnin: 0.5 # proportion of initial samples discarded when computing plots of statistics
singer-mcmc-resumes: 1000 # maximum number of times to try to resume MCMC on error
# --- POLEGON options ---
polegon-mcmc-samples: 100 # number of MCMC samples for POLEGON dating
polegon-mcmc-thin: 10 # thinning interval between MCMC samples for POLEGON dating
polegon-mcmc-burnin: 0.5 # proportion of initial samples trimmed from posterior mean for POLEGON dating
polegon-scaling-reps: 5 # number of timescale rescaling reps after MCMC sampling for POLEGON dating
polegon-max-step: 10.0 # step size used in proposal for POLEGON dating
# --- summary statistics ---
stats-time-grid: [1, 7, 25] # [log10 start, log10 stop, number of intervals] for coalescence rates, etc
stats-window-size: 5.0e4 # size of genomic windows in base pairs for comparing expected vs observedTODO: this is out-of-date
Missing data must be taken into account for the ages of nodes in the ARG to be on the correct timescale. For example, if the sequence has a large amount of missingness, an unadjusted molecular clock will result in estimates of node (ancestor) ages that are substantially younger than the truth (because fewer mutations implies less branch area). Hence, one way to account for missing data is to locally adjust the mutation rate.
There are two sorts of missingness that are relevant: inaccessible intervals (e.g. highly repetitive regions where variants cannot be called) and filtered variants (e.g. "real" variants that are removed for one reason or another, such as absence of ancestral state, missing genotypes, etc.). This pipeline adjusts the global mutation rate within each chunk,
mutation_rate[chunk] = (
mutation_rate *
(accessible_sequence_length[chunk] / total_sequence_length[chunk]) *
(retained_variants[chunk] / total_variants[chunk])
)
where the last term is calculated considering only those variants in accessible intervals.
This sort of crude adjustment works best when bases/variants are missing
completely at random (e.g. not too clustered along the sequence). For instance,
the figure below compares the average mutation ages between the true ARG (for
the data in example/*), and inferred ARGs where: (A) there is
no missing data; (B) around 60% of the sequence is masked in intervals of
average length 800bp, but the mutation rate is not adjusted; (C) the mutation
rate is adjusted to account for missing data using the scheme above.
In all three cases, expected and observed summary statistics are a good match;
the problem with (B) is that the bias introduced by missing data has been absorbed
into node ages rather than mutation rate.
An important consequence is that when calculating expectations of (linear) site
statistics from branch statistics in the ARG (e.g. mode='site' and
mode='branch' in tskit's stats methods), then the contribution from each
chunk must be weighted by the adjusted mutation rate and then summed across
chunks to get the correct values. Some examples of this sort of calculation
can be seen in workflow/scripts/tree_statistics.py. The adjusted mutation
rates are saved as an msprime.RateMap object (see below).
Similarly, to calculate sequence-wide distributions of coalescence time or other topological statistics, the contribution of each chunk should be weighted by the proportion of accessible sequence.
The output files for each chromosome will be generated in results/<chromosome_name>:
- trees/<chromosome_name>.<replicate>.tsz : a tree sequence MCMC replicate generated by SINGER, compressed by
tszip(e.g. usetszip.decompressto load) - <chromosome_name>.recomb_rate.p :
msprime.RateMapcontaining fine-scale recombination rates - <chromosome_name>.mut_rate.p :
msprime.RateMapcontaining fine-scale adjusted mutation rates (see description above) - <chromosome_name>.inaccessible.p :
msprime.RateMapcontaining fine-scale proportion of inaccessible bases (see description above) - <chromosome_name>.filtered.p :
msprime.RateMapcontaining proportion of filtered variants (see description above) - <chromosome_name>.chunks.p :
msprime.RateMapcontaining chunk boundaries (rate is zero if filtered) - <chromosome_name>.windows.p :
msprime.RateMapcontaining windowed statistics boundaries (rate is zero if filtered) - <chromosome_name>.vcf.stats.p : "observed values" for summary statistics (e.g. calculated from with
scikit-allel) - <chromosome_name>.vcf : filtered VCF used as input to SINGER
- chunks/* the raw SINGER output and logs
- plots/pair-coalescence-pdf.png, plots/cross-coalescence-pdf.png: pair coalescence time distribution for all samples and between strata (if supplied), with a thin line for each MCMC replicate and a thick line for the posterior mean
- plots/pair-coalescence-rates.png, plots/cross-coalescence-rates.png: pair coalescence rates for all samples and between strata (if supplied), with a thin line for each MCMC replicate and a thick line for the posterior mean
- plots/diversity-trace.png, plots/tajima-d-trace.png : MCMC trace for fitted nucleotide diversity and Tajima's D
- plots/diversity-scatter.png, plots/tajima-d-scatter.png : observed vs fitted nucleotide diversity and Tajima's D, across chunks
- plots/diversity-skyline.png, plots/tajima-d-skyline.png : observed and fitted nucleotide diversity and Tajima's D, across genome position
- plots/frequency-spectrum.png : observed vs fitted site frequency spectra, folded or unfolded depending on whether data is polarised
- plots/repolarised-trace.png : MCMC trace showing proportion of flipped ancestral states
- plots/mutational-load.png : number of derived mutations per haplotype (should be equal in expectation)
- plots/site-density.png : sanity check showing proportion of missing data, proportion variant bases (out of accessible bases), average recombination rate across chunks
- stats/<chromosome_name>.<replicate>.stats.p : "fitted values" for summary statistics (e.g. statistics from mutations simulated conditional on the ARG)
- stats/<chromosome_name>.<replicate>.coalrate.p : pair coalescence rates (e.g. inverse of haploid Ne) within logarithmic time bins, using all samples
- stats/<chromosome_name>.<replicate>.crossrate.p : cross coalescence rates within logarithmic time bins, between and within strata (e.g. populations) according to the
stratify-byoption in the config file
To check that the inference pipeline works (especially with a realistic pattern of missing data), the
Snakemake script workflow/validation.smk (config template at
config/validation_stdpopsim.yaml and config/validation_msprime.yaml) creates some visual comparisons for simulated data:
- Simulate from a given
stdpopsimordemesmodel with intervals and variants randomly masked; - Infer ARGs from the simulated data with SINGER;
- Calculate the distribution of mutation ages for each possible frequency (e.g. AFS bin), averaging over MCMC replicates;
- Smooth by projecting to a subsample (e.g. bins in a smaller AFS) by hypergeometric sampling;
- Plot alongside the true (simulated) mutation age distributions.
For example, from the human demographic model "OutOfAfrica_3G09" projected to a single diploid
in each of YRI, CEU, and CHB:
Where plot labels are (frequency in YRI, in CEU, in CHB), and the monomorphic bins
(0, 0, 0) and (2, 2, 2) exist because of down-projection (i.e. mutations
which have zero frequency in a subsample of two haploids, taken from some larger sample).

