Skip to content

NICHD-BSPC/termseq-peaks

Repository files navigation

Term-seq peak-caller

https://circleci.com/gh/NICHD-BSPC/termseq-peaks.svg?style=svg

Homepage: https://github.com/nichd-bspc/termseq-peaks

This tool was designed to call 3'-end peaks from term-seq data in bacteria, handling replicates in a statistically robust manner.

Installation

Since termseq-peaks relies on packages that are not in PyPI, the recommended way to install is to use conda and the bioconda channel.

If a full conda installation is not an option for you, it is possible to install via pip, but note that will result in the installation of an incompatible idr version. A workaround is to then uninstall idr via pip and reinstall it via conda (thanks to @Gr1m3y [#5](#5)).

First, ensure conda and bioconda are set up -- see the bioconda docs for details. You may want to install mamba into your base environment to speed up environment creation in general, see here <https://bioconda.github.io/faqs.html#how-do-i-speed-up-package-installation>_ for details.

Then, to use the development version:

git clone https://github.com/NICHD-BSPC/termseq-peaks
cd termseq-peaks

# if using mamba
mamba create -n termseq-peaks --file requirements.txt

# if using conda
conda create -n termseq-peaks --file requirements.txt

# activate new env that has all dependencies
conda activate termseq-peaks

# install into this env
python setup.py install

Once the conda package for termseq-peaks is in Bioconda, this README will be updated with instructions for installing everything via conda.

Usage:

termseq_peaks <bedgraphs> --peaks out.bed [additional options]

Background

This tool takes two novel approaches that together yield good results for calling precise peaks in term-seq data with biological replicates. These approaches are 1) a signal processing approach and 2) an implementation of multi-way IDR to handle >2 replicates.

For peak-calling, the venerable macs2 peak-caller [0] is the go-to method. However, macs2 is designed for ChIP-seq data and all the assumptions that go along with that (modeling peaks based on fragment size, controlling for open chromatin, comparing IP against a background input, etc). We found that naively applying macs2 to term-seq data resulted in suboptimal peak calls. Term-seq signal is composed of single-bp positions of read ends and has a very high dynamic range since it is coming from trancribed RNA that can have very high copy numbers per cell (in contrast to ChIP-seq which we have just 1 copy per haploid cell).

Furthermore, many peak-callers have limited support for multiple replicates. One general solution to this is to apply the Irreproducible Discovery Rate method (IDR), originally developed for the ENCODE project. By design, the IDR method only takes two replicates at a time.

Functions for post-processing are available. The functions assign and table_output are add-on functions to peaklib designed to filter and map the 3'-end peaks relative to annotated ORFs.

Usage

Prepare input

termseq-peaks required input is normalized signal bedGraphs for each replicate. If data are stranded, there should be separate files for each strand. Gzipped bedGraphs are supported automatically if the filename ends in .gz.

One way of doing this might be with deepTools bamCoverage. This example makes a bedGraph file out of unique reads on the minus strand (--samFlagInclude 16), uses 1-bp resolution (--binSize 1), only considers unique reads (--minMappingQuality 20), and uses only the first base of each read to build the signal (--Offset 1, as appropriate for a Term-seq library, for example).

bamCoverage \
  --bam rep1.bam \
  -o rep1_minus.bedgraph \
  --outFileFormat bedgraph \
  --binSize 1 \
  --Offset 1 \
  --minMappingQuality 20 \
  --samFlagInclude 16 \

Run

If we do this for each replicate's minus-strand reads, these bedGraphs can then be provided to termseq-peaks:

termseq-peaks rep1_minus.bedgraph rep2_minus.bedgraph peaks_minus --strand -

By default the output peaks_minus.bed will contain peaks falling below an IDR threshold of 0.05. The full oracle file (the leniently-called peaks on a bedgraph that merges all provided bedgraphs) will be output as peaks_minus.bed.oracle.narrowPeak.

These files can be used with IGV or other genome browsers to inspect the peaks alongside the input signals to assess the peak-calling performance.

For more help, run:

termseq-peaks -h

Algorithm

This tool takes multiple normalized bedGraph files representing the normalized signal for each replicate, and calls a set of consistent peaks at a provided IDR [1] cutoff.

  • Peaks are called using scipy.signal.find_peaks [2] with very lenient parameters to intentionally include both real peaks and noise. These peaks are called on each replicate.
  • The score for the peaks is the "prominence" value for each peak; see [2] for details.
  • For each unique pairwise combination of replicates, IDR routines from [1] are run, resulting in an output file containing merged peaks from those two files along with IDR values for each. In practice the tool stores these as temp files. The number of peaks falling below the IDR threshold is counted for each pairwise comparison. The minimum such number, N, across all pairwise combinations of replicates is used as the final number of peaks to select.
  • All bedGraphs are additionally merged together and peaks are similarly called on that merged signal to get the "oracle" peaks.
  • The oracle peaks are then ranked by their score and the top N peaks are selected as the final peaks. The scores in the final peaks are the scores from the oracle peaks, that is, the peak prominences from calling peaks on the merged bedGraphs.

Output

The find_peaks function returns various metrics. Here, we retrieve the prominence and the width. The prominence is the vertical distance between the peak and the lowest contour line, and the width is measured at half the prominence. See these documentation pages for a visualization of these metrics: prominences and widths.

Output files are in the narrowPeak format, which shows the peak width as well as the position of the summit. We report the prominence as the score as well as the signal value. The position of the peak is the 1-bp position of the prominence.

Caveats

The find_peaks function operates on 1-dimensional vectors, and so returns peak positions in terms of indexes into the input vectors. Internally, we interpolate to back-calculate the corresponding genomic coordinates and round to integers. This may potentially have issues where two peaks that are genomically far away have adjacent indexes (for example, if the intervening region has zero reads anywhere). Empirically we do not observe this to be an issue, but a solution would be to pad out the vector to include zeros at every position in the chromosome/plasmid (and increase RAM usage as a result).

The biggest downside currently is speed and RAM. This is not an issue for the small bacterial genomes the tool was designed for; it takes about 30s to run for E. coli data, and pandas DataFrames are used to store the signal. For larger eukaryotic genomes, parallelization across chromosomes may be required and substantial RAM may be required. This tool remains untested on larger genomes, but has worked quite well for term-seq in several bacterial genomes. Furthermore, since we need to perform IDR between all pairwise combinations of replicates, the running time scales as O(nreplicates^2).

Post-processing

Usage

Prepare input

Required assign input are:

  • strand-specific narrowPeak file, where each interval represents the full size of the detected peak. This can be the output of peaklib function.
  • strand-specific bigWigs corresponding to the narrowPeak files. These files might be generated with bamCoverage. I.e. for negative strand bigwig:
bamCoverage \
    --bam rep1.bam \
    -o rep1_minus.bw \
    --binSize 1 \
    --Offset 1 \
    --minMappingQuality 20 \
    --samFlagInclude 16 \
    --normalizeUsing RPKM

For positive strand bigwig, swap --samFlagInclude 16 for --samFlagExclude 16

  • annotation gtf file. The function assumes it contains the mRNAs, sRNAs, tRNAs and rRNAs.
  • genome fasta file
  • file containing the list of tRNAs and rRNAs names in the 1st column of a tab-separated file. Names can be exact or regex.

Required table_output input are:

  • strand-specific curated peaks assigned to ORFs in a tab-separated file. Typically this is the output of the assign function.
  • optional: opposite strand-specific curated peaks assigned to ORFs in a tab-separated file. Typically this is the output of the assign function.
  • optional: Kinefold output file corresponding to the one or both strand(s) curated peaks.

Run

The assign function can be run with:

termseq-peaks assign \
    --sample sample1_minus \
    --narrowPeak sample1_minus.narrowPeak \
    --bw sample1_minus-rep1.bw sample1_minus-rep2.bw \
    --fasta genome.fa \
    --gtf annotation.gtf \
    --trRNA trRNAs.tsv

The curated peaks assigned to an ORF can be found in the output file all.sample1_minus.tsv

The table_output function can be run with:

termseq-peaks table_output \
    --sample sample1 \
    --assigned all.sample1_minus.tsv \
    --assigned2 all.sample1_plus.tsv \
    --kinefold_scores kinefold_output.tsv \

The summary file is saved as sample1_TableS1.tsv.

Algorithm

Function assign:

  • Return a 1bp-coordinate narrowPeak file corresponding to the highest score coordinate within cluster distance
  • Assign peaks to particular classes:
    • primary: within 3'end of any ORF (mRNA, tRNA, rRNA, sRNA) included and param_down-bp downstream on the same strand AND has the highest readcount of all such peaks associated within the same region.
    • secondary: fulfills the above criteria with respect to location BUT is NOT the peak with the highest readcount.
    • antisense: located within param_antisense-bp upstream, downstream or in an ORF of the opposite strand.
    • internal: within an any ORF (mRNA, tRNA, rRNA, sRNA) coordinates, excluding the 3'end coordinate on the same strand.
    • orphan: not associated with any of the above categories. Peaks can have multiple classifications.
  • Also returns lists of peaks within param_upstart-bp upstream of start codon to param_downstart-bp downstream of start codon, and within param_upstop-bp downstream of start codon to the stop codon.

Function table_output:

  • Concatenate results from the function assign and optionally adds the Kinefold scores if provided.

References

About

Peak-calling for term-seq data in bacteria

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •