This is a Python port of the find_peaks program written in Perl. And now it runs at a blazing fast speed! (~120 times faster than original)
This program requires no additional Python package, and will theoretically run in any Python version >=3.6, but a Python version of 3.11 and above is highly recommended as it brings about ~15% performance improvements.
The program can be run from the command line with the following syntax:
find_peaks.py [-h]
[--n N] [--fdr FDR] [--frac FRAC]
[--min_count MIN_COUNT] [--min_quant MIN_QUANT] [--step STEP]
[--unified_peaks {max,min}]
[--no_discard_zeros]
[--seed SEED]
files [files ...]
Simple FDR random permutation peak caller
positional arguments:
files Input files in bedgraph or GFF format
options:
-h, --help show this help message and exit
--n N Number of iterations
--fdr FDR False discovery rate value
--frac FRAC Fraction of random fragments to consider per iteration (0..1)
--min_count MIN_COUNT
Minimum number of fragments to consider as a peak
--min_quant MIN_QUANT
Minimum quantile for considering peaks
--step STEP Stepping for quantiles
--unified_peaks {max,min}
Method for calling peak overlaps (two options): 'min': call minimum overlapping peak area 'max': call maximum overlap as peak
--no_discard_zeros Treat zero scores as non-empty reads in raw data
--seed SEED Random seed
N: The number of random shuffle iterations to perform. The default is100. The higher the number, the more accurate the FDR will be, but the longer the program will take to run.FDR: The false discovery rate to use. The default is0.01. This is calculated as the number of occurrence of peaks in the random shuffles over the number of peaks found in the real data.FRAC: The fraction of the data to be used for calculating random shuffles (where 0 means no truncation). The default is0. This is used to speed up the program, as the random shuffles and samples are taken on a subset of the data.MIN_COUNT: The minimum number of consecutive fragments to consider as a peak. The default is2.MIN_QUANT: The minimum quantile to consider as a peak. The default is0.95. Higher values will result in fewer peaks being called thus less noise in output.STEP: The step size for quantiles. The default is0.01. Smaller steps results in finer result at the cost of run speed.UNIFIED_PEAKS: The method for calling peak overlaps. The default ismax. Ifmax, the maximum overlap is called as the peak thus results in fewer but larger peaks. Ifmin, the minimum overlap is called as one peak and results in more but smaller peaks.no_discard_zeros: If set, treat zero scores as non-empty reads in raw data. The default isFalse. This is useful for data that has a lot of zeros, such as ChIP-seq data.SEED: Random seed used for shuffling. The default is0.
- Call peaks on a single
gfffile with default parameters:find_peaks.py path/to/file.gff
- Call peaks on multiple
bedgraphfiles with default parameters:find_peaks.py path/to/file1.bedgraph path/to/file2.bedgraph
- Shuffle the first 1k reads for 1000 iterations and filter result by 0.05 false discovery rate:
find_peaks.py --n 1000 --frac 1000 --fdr 0.05 path/to/file.bedgraph
- Set random seed to 42:
find_peaks.py --seed 42 path/to/file.bedgraph
- Call peaks with minimum 5 consecutive fragments and minimum 0.99 quantile:
find_peaks.py --min_count 5 --min_quant 0.99 path/to/file.bedgraph
Some modifications and improvements are made to the original Perl program:
| Case | Perl (original) | Python (this) |
|---|---|---|
| No regression result found when calling significant peaks (e.g. when the random shuffles do not have an enough variety of peak lengths, so no regression can be made) |
Continue by using uninitialized values (undefined behavior) | Skip calling peaks for this peak min value, and will print a warning message to stderr instead |
| Random seed used for shuffling | Use different ones for each execution | Fixed and can be reset by the user, for reproducibility |
| Processing probe reads with zero scores | Keep probe reads with zero scores and skip them later | Discard them by default to increase processing speed, yet can be set to keep them by user |
| Generating random shuffles of fraction of the data (e.g. when the data is too large to shuffle, or when the user wants to speed up the program) |
Use a fixed first few number of the whole and shuffle them in place | Use random.sample instead, and the number to sample are calculated through simulated sampling from all reads (including zero-scored ones) |
