Skip to content

Evaluate CLIP-seq and other genomic region data using a comprehensive collection of known RBP binding motifs

License

Notifications You must be signed in to change notification settings

michauhl/RBPBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RBPBench

GitHub Bioconda

RBPBench is multi-function tool to evaluate CLIP-seq and other genomic region data using a comprehensive collection of known RNA-binding protein (RBP) binding motifs. RBPBench can be used for a variety of purposes, from RBP motif search (database or user-supplied RBP motifs) in genomic regions, over motif enrichment and co-occurrence analysis, in-depth comparisons over multiple datasets via sequence and genomic annotation statistics, to benchmarking CLIP-seq peak caller methods as well as comparisons across cell types and CLIP-seq protocols. RBPBench supports both sequence and structure motifs, as well as regular expressions (sequence and structure patterns). Moreover, users can easily provide their own motif collections.

Table of contents

Installation

RBPBench was developed and tested on Linux (Ubuntu 22.04 LTS). Currently only Linux operating systems are supported. To install RBPBench, you first need to install the Conda package manager.

Conda

If you do not have Conda on your system yet, you can e.g. install Miniconda, a free + lightweight Conda installer. Get Miniconda here, choose the latest Miniconda3 Python Linux 64-bit installer and follow the installation instructions. To do it in the command line:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

In the end, Conda should be evocable on the command line via (possibly in a different version):

$ conda --version
conda 24.5.0

Conda package installation

RBPBench is available as a Conda package, which makes installing it a breeze. We simply create a Conda environment and install RBPBench inside the environment:

conda create -n rbpbench -c conda-forge -c bioconda rbpbench
conda activate rbpbench

RBPBench should now be available inside the environment:

rbpbench -h

Manual installation

Manual installation of RBPBench is only slightly more work. First we create the Conda environment with all necessary dependencies:

conda create -n rbpbench -c conda-forge -c bioconda python bedtools goatools infernal logomaker markdown matplotlib-venn meme plotly pybigwig scikit-learn scipy textdistance upsetplot venn

Next we activate the environment, clone the RBPBench repository, and install RBPBench:

conda activate rbpbench
git clone https://github.com/michauhl/RBPBench.git
cd RBPBench
python -m pip install . --ignore-installed --no-deps -vv

RBPBench should now be available inside the environment:

rbpbench -h

RBPBench webserver

RBPBench is also available as a webserver on Galaxy (more infos soon).

Example runs

To run the examples, we change to the cloned repository subfolder RBPBench/test and download the genome FASTA file (details), plus a fitting GTF file, e.g., from Ensembl (details):

cd test/
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
wget https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz

We also unzip the collection of ENCODE eCLIP CLIPper IDR peak region BED files (i.e., genomic RBP binding regions determined by CLIPper IDR method on RBP-specific eCLIP data), which we will use as example input files to demonstrate the various RBPBench modes:

unzip eclip_clipper_idr.zip

The folder content looks like this:

ls -1 eclip_clipper_idr
AGGF1_HepG2_IDR_peaks.bed
AGGF1_K562_IDR_peaks.bed
AKAP1_HepG2_IDR_peaks.bed
AKAP1_K562_IDR_peaks.bed
...

We can see that we have a total of 172 genomic region files to work with, comprised of 88 for K562 cell line, 84 for HepG2 cell line, and a total of 110 different RBPs over the two cell lines:

ls -1 eclip_clipper_idr | wc -l
172
ls -1 eclip_clipper_idr | grep -c "K562"
88
ls -1 eclip_clipper_idr | grep -c "HepG2"
84
ls eclip_clipper_idr | cut -d'_' -f1 | sort | uniq | wc -l
110

Motif search with single input dataset

Search with selected RBPs

For motif search in a single set of genomic input regions (typically CLIP-seq peak regions), we can use rbpbench search mode. This mode contains many useful functions and command line options (please check out the help prompt rbpbench search -h as well as the documentation below for more details on the various functionalities of the different modes). In this first example, we take PUM2 eCLIP regions and search for motifs of PUM1, PUM2, RBFOX2 (see rbpbench info for all available RBPs), as well as the regular expression AATAAA (known polyadenylation signal sequence, more on regex use here):

rbpbench search --in eclip_clipper_idr/PUM2_K562_IDR_peaks.bed --genome hg38.fa --gtf Homo_sapiens.GRCh38.112.gtf.gz --out test_search_pum2_ex1_out --rbps PUM2 PUM1 RBFOX2 --ext 10 --regex AATAAA

Here we also extend the input regions by 10 nt up- and downstream (--ext 10, for different up- and downstream use e.g. --ext 0,20) before search. Fig. 1 shows the plots generated in this example (found in HTML report file report.rbpbench_search.html, additional plots can be generated e.g. via --enable-upset-plot, --set-rbp-id). Detailed explanations on plot contents can be found in the generated HTML file. Note that apart from HTML reports, RBPBench also outputs various table files (e.g., motif hit regions, region annotations, RBP + motif hit statistics, more details e.g. here), which can later be used for comparing search results or again as input regions for motif search.

Fig. 1: Example visualizations produced by rbpbench search (PUM2 example). a: RBP motif co-occurrences heat map. b: Input sequence length distribution plot. c: Input sequences by k-mer content plot. d: Region annotations per RBP. e: mRNA region coverage profile. f: Exon-intron overlap statistics. g: Intronic regions intron border distances.

Note that for plotting purposes the interactive plots (here and subsequent figures) appear more cramped than in the HTML reports.

Fig. 1a shows the RBP motif co-occurrences heat map (color scale values are -log10 of Fisher's exact test p-value), visualizing significant RBP motif co-occurrences (details). We can see that PUM2 and PUM1 binding motifs frequently co-occur, which makes sense since their database motifs are very similar (to visualize all selected RBP motifs in HMTL, add --plot-motifs and inspect motif_plots.rbpbench_search.html, example entry shown in Fig. 2). We also see significant co-occurrences of AATAAA regex hits with PUM2 and PUM1, which too is reasonable since PUM2 binding sites tend to be in 3'UTR regions, often also towards their ends (as showcased in Fig. 1e and Fig. 1f). In contrast, RBFOX2 motif hits are less frequent and show no significant co-occurrence (expected since RBFOX2 is known to be involved in splicing regulation). Fig. 1b shows the distribution of input sequence lengths, including the sequences and motif hits for each sequence. Fig. 1c positions input sequences based on their k-mer contents (i.e., the closer two sequences in the plot, the more similar their k-mer contents, default k=3). The hover box shows additional information for each sequence, including its sequence, genomic region annotations, motif hits and nucleotide percentages. This way we can e.g. easily inspect general sequence characteristics of the input regions, or spot outliers. Fig. 1d shows genomic region annotations for all input regions (add --add-all-reg-bar), as well as the regions with motif hits for each specified RBP. Fig. 1e displays the coverage of mRNA regions (5'UTR, CDS, 3'UTR) by the input regions. Fig. 1f provides exon-intron overlap statistics of the input regions, including upstream and downstream proximate and distant intronic regions, exon-intron border regions, and also first, last, and single exon regions. All three plots (Fig. 1d-f) confirm PUM2's known binding preferences (namely spliced exonic RNA, specifically 3'UTR regions, often towards the transcript 3' end, confirmed by high last exon coverage). Fig. 1g displays the distances of intronic regions to their nearest intron border (one plot for upstream, one for downstream border regions). The hover box again provides additional information for each region, including its sequence, annotations, border distances, intron numbers, lengths, and motif hits. As expected for PUM2, we only see a few intronic regions, possibly from missanotations, which based on the hover box information could be further checked in a genome viewer (e.g., IGV).

Fig. 2: Example entry for RBP RBFOX2 from motif_plots.rbpbench_search.html. The --plot-motifs HTML report (add --plot-motifs to the above call to create it) visualizes the selected motifs, provides literature references, as well as motif hit counts (including hit statistics for matched sequences) and genomic region annotations for the motif hit regions themselves.

Search with all RBPs

To search with all database RBP motifs in a single set of genomic input regions, we can use --rbps ALL. The motif database can be changed via --motif-db (details on motif selection here), and the user can also define own motifs or even a motif database (details). Moreover, the selected RBPs can be filtered by annotated functions (--functions, details). For this example we use the SLBP eCLIP regions, select all RBPs, further filter them by some annotated functions, and also conduct a GO term enrichment analysis on the input regions:

rbpbench search --in eclip_clipper_idr/SLBP_K562_IDR_peaks.bed --genome hg38.fa --gtf Homo_sapiens.GRCh38.112.gtf.gz --out test_search_slbp_tr_rsd_tep_out --functions TR RSD TEP --rbps ALL --ext 20 --goa

Here we filter the selected RBPS (i.e., all RBPs) to keep only RBPs with the following annotated functions: translation regulation (TR), RNA stability & decay (RSD), 3' end processing (TEP). This results in 91 RBPs for search (out of 257), with 260 motifs (out of 599).

Fig. 3: Example visualizations and statistics produced by rbpbench search (SLBP example). a: RBP motif co-occurrences heat map. b: RBP region score motif enrichment statistics (top 10). c: GO enrichment analysis results. d: mRNA region coverage profile.

Fig. 3 shows some of the resulting plots and statistics from the HTML report. Fig. 3a again shows the RBP motif co-occurrences heat map, this time between the 91 selected RBPs. Fig. 3b presents the RBP region score motif enrichment statistics table (details). This in short checks whether any of the RBP motifs (on RBP level) tend to occur more often in higher scoring input regions. As default, BED column 5 is used as score (change via --bed-score-col), which in the input examples is the log2 fold change of the CLIP-seq peak region. We can see that SLBP has the lowest p-value here, as expected (assuming that the region score is to some extent indicative of binding site quality or binding affinity), showing that the statistic can be useful e.g. to check for dataset quality, or for co-enriched RBP motifs. Region scores can be further taken advantage of by comparing k-mer distributions of top and bottom scoring input regions (details, plus Fig. 4a example). Fig. 3c shows the GO term enrichment analysis results. As expected for SLBP, we can see many chromatin related terms in the top GO terms (more on GOA settings here). Fig. 3d again depicts the mRNA region coverage profile, this time for SLBP (see Fig. 1e for PUM2 profile). Note that the relative mRNA region lengths in the plot can change, depending on the number of input regions and on which mRNAs these map. Also note that this plot becomes less informative if the percentage of input regions mapping to mRNA regions gets lower (info given in HTML figure legend, here: 82.0%). Most often, this is the case for intron-binding RBPs, where you typically have low percentages of exonic regions and thus also low mRNA coverage.

Additionally, rbpbench search HTML report includes several plots showing k-mer distributions and variations in the input dataset (Fig. 4), which also take into account the region scores to identify enriched k-mers, which can hint at RBP binding preferences.

Fig. 4: Example visualizations produced by rbpbench search (SLBP example). a: Top vs bottom scoring regions k-mer distribution. b: Input sequences k-mer variation plot.

The first plot (Fig. 4a) shows the k-mer distribution of the top and bottom scoring input regions (set k=4, change via --kmer-plot-k). We can see several k-mers that are enriched in the top scoring regions, such as CAAA, CCAA, CTTT, or AAAG. It is known that SLBP binds a conserved stem loop element in histone mRNA 3'UTRs, and the reported k-mers are also highly conserved in and around the stem loop (reported e.g. here). The second plot (Fig. 4b) looks more closely at the k-mer variation in the input regions, plotting for each k-mer the percentage of input regions it occurs in (y-axis) vs the variation of the k-mer over the dataset (x-axis, using the coefficient of variation (CV) of k-mer ratios over all input regions). In short, the lower the CV of a k-mer, the more even the distribution of its ratios (see HTML report for details). Moreover, the k-mers are colored by their correlation (Spearman) with the input region scores (k-mer ratios vs scores). This way we can identify k-mers that appear evenly over the dataset (high site %), that have similar ratios over the dataset (low CV), and that are also enriched in the top scoring regions (higher correlation with scores). For example, a k-mer might have high site % but a lower correlation, which can be interpreted as a sequence element that often occurs in the binding site context, but does not contribute that much to the RBP binding preference or does not have high affinty towards the RBP. These could be repeat elements (unless the RBP binds to repeats), which also often feature a shift in the k-mer variation to the right, as the number of repeats over the input regions varies more. Again, we can see an agreement with reported SLBP binding preferences, showing the known k-mers enriched in top scoring regions (although they are not necessarily the ones with highest site %).

Coming back to our motif hits, to take a closer look at how these are distributed around a specific RBP, we can specify the RBP to focus on via set-rbp-id:

rbpbench search --in eclip_clipper_idr/SLBP_K562_IDR_peaks.bed --genome hg38.fa --gtf Homo_sapiens.GRCh38.112.gtf.gz --out test_search_slbp_tr_rsd_tep_gh_out --functions TR RSD TEP  --rbps ALL --ext 20 --set-rbp-id SLBP --greatest-hits

Here we also enable --greatest-hits, meaning that for each input region and RBP only the motif hit with the lowest p-value gets reported. This has an impact on the co-occurrence heat map statistics, namely through changing the mean minimum motif distance (if we want to filter by this via setting --min-motif-dist to > 0, details), but also on the distance plots that get produced in the example (Fig. 5). This way we can focus for each RBP motif on the best motif hit for each input region (by default all hits are included in the outputs and analysis).

Fig. 5: Example visualizations and statistics produced by rbpbench search (SLBP example with --set-rbp-id). a: Set RBP SLBP motif distances plot. b: Motif distance statistics table.

Fig. 5a shows the distances of other RBP motifs relative to the set RBP (SLBP here), resulting in a coverage plot for each RBP that passes the filter thresholds (default minimum motif pair count 10, set via --rbp-min-pair-count).
This plot is generated both on the RBP level, but also on the single RBP motif level (so for each motif associated with the set RBP, Fig. 5a shows RBP level). Fig. 5b shows the RBP motifs most frequently observed in the neighborhood of the set RBP (here on motif level). These plots give us an idea which motifs occur at which distances in the input regions. To get a more fine-grained picture of co-occurrences on the single motif level, including signifying motif enrichment in input regions, we can use rbpbench enmo or rbpbench nemo (more details in the examples below and here). Alternatively, single motifs can be selected in the various search modes using --motifs followed by the motif ID(s) of interest. For example, to select only the first two PUM2 motifs for search out of all selected RBP motifs:

rbpbench search --in eclip_clipper_idr/PUM2_K562_IDR_peaks.bed --genome hg38.fa --gtf Homo_sapiens.GRCh38.112.gtf.gz --out test_search_pum2_1_2_out --rbps ALL --ext 10 --motifs PUM2_1 PUM2_2

Motif search with multiple input datasets

RBPBench also supports batch processing of multiple input datasets (rbpbench batch), i.e., to run motif search on multiple input datasets. This allows us to learn more about general, group- or set-specific features for a large number of CLIP-seq (or similar methods) datasets, by inspecting the generated HTML batch report plots and statistics. In addition, it can be used to generate the input files for benchmarking peak callers, or to e.g. do CLIP-seq protocol or cell type comparisons, using rbpbench compare.

A batch of input BED regions can be provided to rbpbench batch in several ways, e.g. completely via the command line. However, for a large set of input datasets, the easiest way is to provide a table file where one specifies the RBP used for search, the method ID and the data ID (to define what gets compared in rbpbench compare), and the path to the BED file. For example, the first three lines of our K562 (a cell type) eCLIP example table file look like this:

head -3 eclip_clipper_idr.k562.batch_in.txt
AGGF1	clipper_idr	K562_eclip	eclip_clipper_idr/AGGF1_K562_IDR_peaks.bed
AKAP1	clipper_idr	K562_eclip	eclip_clipper_idr/AKAP1_K562_IDR_peaks.bed
BUD13	clipper_idr	K562_eclip	eclip_clipper_idr/BUD13_K562_IDR_peaks.bed

To run rbpbench batch on all the 88 K562 eCLIP sets (one line per set) specified in the table file:

rbpbench batch --bed eclip_clipper_idr.k562.batch_in.txt --genome hg38.fa --gtf Homo_sapiens.GRCh38.112.gtf.gz --ext 10 --out batch_K562_eclip_clipper_idr_out --regex AATAAA

Note that we again use a regex, which conveniently allows us to check enrichment and co-occurrences of this regex in all input datasets. Also note that many of the generated batch statistics and plots are independent of motif hit statistics, i.e., they rather look at genomic annotations, occupied genes, or sequence content of input regions. This makes rbpbench batch a great solution for comparing a large number of existing CLIP-seq datasets to a specific dataset of your interest (e.g., also derived from CLIP-seq or similar protocols), to quickly check for differences and similarities between the datasets. The output results folder again contains several table files (e.g., the hit statistics files used as inputs to rbpbench compare, details), as well the HTML report file report.rbpbench_batch.html, containing various comparative plots and statistics. For the example call above, the following statistics and plots are included in the HTML report file:

  1. Input datasets sequence length statistics
  2. Input datasets exon-intron overlap statistics
  3. Input datasets exon-intron overlap comparative plot
  4. Input datasets RBP region score motif enrichment statistics
  5. Regular expression motif enrichment statistics
  6. Regular expression RBP motif co-occurrence statistics
  7. Input datasets nucleotide percentages statistics
  8. Input datasets k-mer frequencies comparative plot
  9. Input datasets k-mer variation comparative plot
  10. Input datasets occupied gene regions comparative plot
  11. Input datasets occupied gene regions similarity heat map
  12. Input datasets genomic region annotations comparative plot
  13. Plus a genomic regions annotation plot for each input dataset
  14. Additional region annotation statistics

Detailed explanations can be found in the corresponding table and plot legends in the HTML file. Fig. 6 shows the 2 plots related to k-mer sequence produced by the above call:

Fig. 6: Example sequence k-mer plots produced by rbpbench batch (K562 eCLIP datasets example). a: Input datasets k-mer frequencies comparative plot. b: Input datasets k-mer variation comparative plot.

Fig. 6a plot compares the input datasets by the k-mer frequencies of their site sequences (by default 5-mer frequencies, change via --kmer-size, 3D-PCA dimensionality reduction). In addition, for each dataset, the top 10 5-mer frequencies are given in the hover box (plus more infos like mono-nucleotide percentages). Also, the datasets are colored by their average sequence complexities (the higher the more even the mono-nucleotide percentages in the dataset, change to di-nucleotide percentages via --seq-comp-k). This way we can quickly assess sequence similarities and dissimilarities between input datasets, or e.g. spot interesting outliers. Fig. 6b extends on the k-mer frequencies plot, this time showing the k-mer variation on site level in the input datasets. The datasets are positioned based on the k-mer site percentage profiles. I.e., for each k-mer the percentage of sites in which the k-mer occurs is used as feature. (default 4-mers, change via --seq-var-kmer-size). Dataset coloring is the mean site percentage of all present k-mers in the dataset, and the hover box shows many informative statistics (top and bottom 10 k-mers by site percentage, present k-mer percentage, average k-mer percentage, as well as site length statistics). This gives us additional metrics to compare the datasets: In general, we can assume that k-mers with high site percentages contribute more to the RBP binding (possibly also as direct binding motifs), while low percentage k-mers likely should also have low affinity to the RBP. A higher mean percentage for a dataset means that the present k-mers are more evenly distributed over the dataset sequences. This can stem from a larger sequence lengths, or in general a more diverse set of sequences, possibly reflecting RBP binding preferences.

Fig. 7 shows the 4 plots related to genomic annotations produced by the above call:

Fig. 7: Example genomic annotation plots produced by rbpbench batch (K562 eCLIP datasets example). a: Input datasets exon-intron overlap comparative plot. b: Input datasets occupied gene regions comparative plot. c: Input datasets occupied gene regions similarity heat map. d: Input datasets genomic region annotations comparative plot.

Fig. 7a plot shows the exon-intron overlap statistics for the input datasets, visualized as 2D PCA plot. The closer two datasets (i.e., the dots representing them), the more similar the datasets are w.r.t. their exon-intron overlap statistics. This allows us to quickly identify groups of RBPs with similar exon-intron binding characteristics, or to spot unusual RBPs. Utilized exon-intron overlap categories are: exonic regions, intronic regions, intronic regions close to intron borders (upstream and downstream), intronic regions distant from intron borders (upstream and downstream), exon-intron border regions (+/- 50 nt of exon-intron borders), first exons, last exons, and single exon transcripts. Fig. 7b plot again uses 3D-PCA, this time on the occupied gene regions for each input dataset. Here we can quickly identify e.g. datasets with very low numbers of occupied genes, or in general datasets with similar gene occupancy profiles. Fig. 7c further expands on occupied genes statistics, this time using the cosine similarity to measure the similarity in occupied genes between datasets. In addition, the datasets are ordered based on hierarchical clustering, resulting in similar datasets appearing close together on the axes. This enables us to identify groups of datasets which have similar occupancy profiles. Fig. 7d gives us the genomic region annotations for each input dataset, using 2D-PCA to reduce dimensions. Moreover, datasets are colored by the highest percentage annotation, allowing us to identify groups of input datasets with similar binding characteristics (e.g., the blue dots represent datasets whose regions primarily overlap with 3'UTR regions). If a regex is supplied, this also shows up here, meaning over all datasets, the regex hit regions are annotated and appear in the plot.

Comparisons between search results

To compare peak calling methods or in general conditions based on motif hit results, we can use the RBPBench search modes (typically rbpbench search or rbpbench batch). For example, we can take the following input table for rbpbench batch:

cat batch_compare_test.batch_in.txt
PUM1	clipper_idr	k562_eclip	batch_compare_test/PUM1.k562_eclip.clipper_idr.bed
PUM1	dewseq_w100_s5	k562_eclip	batch_compare_test/PUM1.k562_eclip.dewseq_w100_s5.bed
RBFOX2	clipper_idr	hepg2_eclip	batch_compare_test/RBFOX2.hepg2_eclip.clipper_idr.bed
RBFOX2	clipper_idr	k562_eclip	batch_compare_test/RBFOX2.k562_eclip.clipper_idr.bed

Here we essentially have two comparisons defined for us (details on how to define comparisons here):

  1. compare the two peak calling methods clipper_idr, dewseq_w100_s5 based on common RBP ID PUM1 and condition k562_eclip
  2. compare the two conditions hepg2_eclip, k562_eclip based on common RBP ID RBFOX2 and peak calling method clipper_idr

We thus first run rbpbench batch to search for motifs in these datasets:

rbpbench batch --bed batch_compare_test.batch_in.txt --genome hg38.fa --ext 10 --out test_batch_out

Then we simply provide the batch search output folder to rbpbench compare (multiple folders can be supplied as well), and rbpbench compare will find all compatible combinations in these folders and do the comparisons via:

rbpbench compare --in test_batch_out --out test_compare_out

Each comparison (two in this example) gets its own section in the output HTML report file test_compare_out/report.rbpbench_compare.html. For each comparison, a comparison statistics table and a Venn diagram is created. The table has the following format:

Column name Description
Method/Data ID Method/Data ID that gets compared (typically peak calling method or condition)
# regions Number of input regions used for motif search
# motif hits Number of unique motif hits in input regions (removed double counts)
% regions with motifs Percentage of input regions with motif hits
% motif nucleotides Percentage of unique motif nucleotides over effective input region size (overlapping regions merged)
# motif hits per 1000 nt Number of motif hits over 1000 nt of called input region size (overlapping regions NOT merged)

This information sums up the performance of the peak calling method, using the motif hit statistics as performance metrics. More infos on motif hit statistics can be found here (on unique hits) and here (on motif hit statistics). For the PUM1 comparison:, we get the following table:

Method ID # regions # motif hits % regions with motifs % motif nucleotides # motif hits per 1000 nt
clipper_idr 2661 438 12.33 1.52 2.04
dewseq_w100_s5 1583 664 23.88 1.65 2.31

Note that the number of hits (in rbpbench search, rbpbench batch etc.) is influenced by various search parameters, e.g., the input region extension (here --ext 10), or the motif hit threshold (for FIMO sequence motif search change via --fimo-pval). Longer input regions and more relaxed thresholds naturally lead to more motifs hits, and thus can change the comparison results, although the basic trends typically remain similar. In this example, we can see that for the PUM1 dataset, 23.88% of DEWSeq peak regions contain >= 1 PUM1 motif hit (CLIPper IDR 12.33%). We can also see that the number of motif hits over 1,000 nt called peak region size is higher for DEWSeq (2.31 vs. 2.04).

As for second comparison (RBP ID: RBFOX2, peak calling method: clipper_idr, comparing conditions / cell types HepG2 + K562), we get this table:

Data ID # regions # motif hits % regions with motifs % motif nucleotides # motif hits per 1000 nt
hepg2_eclip 7477 7118 41.25 3.26 10.44
k562_eclip 3525 2768 32.94 2.55 8.27

We can see that the eCLIP data from the HepG2 cell type in general contains more motif hits (absolute and percentage-wise), which can hint at the quality of the experimental data.

As any given genomic motif hit can either be found only by one method, or be identified by any set (>=2) of methods, it makes sense to visualize this information as a Venn diagram. For the two comparisons, the produced Venn diagrams looks like this:

Fig. 8: Venn diagrams produced by rbpbench compare. a: Comparing motif hit overlap between peak calling methods clipper_idr and dewseq_w100_s5 (PUM1 K562 ECLIP). b: Comparing motif hit overlap between conditions hepg2_eclip and k562_eclip (i.e., RBFOX2 eCLIP in two different cell types HepG2, K562). Motif hit numbers and percentages of total motif hits are shown for each Venn diagram area (method exclusive and intersection).

We can see that for the peak calling comparison (Fig. 8a), 31% of each method's unique motif hits overlap. In the second comparison (Fig. 8b, comparing cell types), we get an even lower overlap of 15%. Both of these are valuable insights, which help us to better understand and interpret the data.

Single motif enrichment and co-occurrences

ENMO mode

To get enrichment and co-occurrence statistics on single motif level, we can use rbpbench enmo and rbpbench nemo. The difference is that rbpbench enmo checks for enriched motifs in the input regions, while rbpbench nemo checks for neighboring motifs in the input regions, including differential up- and downstream motif analysis. This makes rbpbench nemo well suited for analysing e.g. motif hit regions that were obtained from one of the other search modes, which optionally can be included or excluded in the motif search space. To check for enriched motifs and their co-occurrences in a set of input regions on a single motif level using rbpbench enmo:

rbpbench enmo --in eclip_clipper_idr/PUM2_K562_IDR_peaks.bed --genome hg38.fa --gtf Homo_sapiens.GRCh38.112.gtf.gz --out test_enmo_pum2_out --rbps ALL --ext 10 --min-motif-dist 10 --motif-sim-thr 2

Note that only the enriched motifs (passing the set p-value threshold --enmo-pval-thr) are reported and then included in the following co-occurrence analysis. Also note that reported p-values can differ between runs (due to random background set generation, to keep results same set a --random-seed). Fig. 9 shows the generated visualizations:

Fig. 9: Example visualizations and statistics produced by rbpbench enmo. a: Motif enrichment statistics, showing significantly enriched motifs in input regions. b: Single motif co-occurrences heat map, showing significant co-occurrences between enriched motifs. c: Sequence motif similarity vs significance PCA plot. d: Sequence 5-mer percentages in the input and background dataset.

As we can see in Fig. 9a (top 10 enriched motifs), the PUM2 motifs (+ one highly similar PUM1 motif) are the most enriched motifs in the input dataset, as expected. Fig. 9b shows the co-occurrence heat map for the enriched motifs. Note that we can further control what motifs are reported as significant here by additional parameters, most importantly --cooc-pval-thr, --min-motif-dist, and --motif-sim-thr. The first is simply the p-value threshold (by default Benjamini-Hochberg corrected, change via --cooc-pval-mode), while --min-motif-dist is the mean minimum motif distance between the pair of motifs in all input regions, and --motif-sim-thr is the maximum motif pair similarity (calculated using TOMTOM). The later two thus allow us to e.g. focus on motif pairs that are on average not too close to each other, and are also not too similar to each other. Note that for pairs that do not meet the set thresholds, their hover box in the heat map informs about the reason for filtering out the pair. Fig. 9c plots the motifs arranged by their similarity and colored by their significance, allowing us to identify groups of motifs with similar motif contents. Fig. 9d compares the k-mer (here 5-mer) distribution between the input and the generated background regions. Here we can also see that 5-mers similar to the PUM2 motifs appear more frequently in the input regions. Generation of the background regions can be controlled with various parameters (see rbpbench enmo -h for more details). For example, --bg-mode allows us to either use shuffled negatives (from input set), or randomly sampled negatives from genomic or transcript regions (depending on the type of input regions). Moreover, the background set size can be increased (--bg-min-size), plus the user can tell RBPBench from which genomic (or transcript) regions to sample (--bg-incl-bed) or to not sample (--bg-mask-bed).

NEMO mode

To do motif enrichment analysis on regions neighboring the input regions (ideally short motif regions or single positions of interest), we can use rbpbench nemo. By default, motif hits overlapping the input regions are not included in the analysis (change with --allow-overlaps). Otherwise, available arguments are very much the same as for rbpbench enmo. In this example, we will use the mRNA region end positions, and look at the adjacent up- and downstream regions. To get these, we will use some helper scripts that come bundled with RBPBench (details on available helper scripts here):

gtf_extract_mpt_region_bed.py --gtf Homo_sapiens.GRCh38.112.gtf.gz --out mrna_regions_hg38_out --mrna-only
bed_print_last_n_pos.py --in mrna_regions_hg38_out/mpt_regions.bed --ext 1 > mrna_region_end_pos.bed

Now let's run rbpbench nemo with an up- and downstream extension of --ext 30 to the supplied mRNA end positions. Furthermore, we allow overlaps and restrict search to RBPs with annotated functions in 3' end processing (TEP), RNA stability & decay (RSD), and translation regulation (TR) (91 RBPs):

rbpbench nemo --in mrna_region_end_pos.bed --genome hg38.fa --gtf Homo_sapiens.GRCh38.112.gtf.gz --out test_nemo_mrna_ends_out --rbps ALL --ext 40 --min-motif-dist 10 --motif-sim-thr 2 --allow-overlaps --functions TEP RSD TR

Fig. 10 shows us the resulting neighboring motif enrichment statistics table (only top 10 enriched motifs shown):

Fig. 10: Neighboring motif enrichment statistics table (top 10 results) produced by rbpbench nemo.

We can see that the table is slightly expanded compared to the rbpbench enmo table. I.e., we now have additional info on whether motif hits tend to occur more in up- or downstream regions relative to the input regions (signified by Wilcoxon rank-sum test p-value, as well as a motif distance plot showing the distribution of motif hit center locations relative to the centered input regions). A low test p-value in this table means that there is a preference for either up- or downstream binding (two-sided test). The top enriched motifs are the motifs essentially describing the polyadenylation signal sequence (AATAAA), which is known to often occur near 3'UTR ends. This preference we can also nicely see in the motif distance plots (plus it is significant w.r.t. test p-value). Looking at the regions downstream the annotated mRNA ends, we see a preference of T and GT-rich motifs (which continues as we would go beyond the top 10). These GT-rich (or actually GU-rich in RNA) elements (GREs) are too known to frequently occur at transcript ends, namely downstream the polyadenylation signal. Both elements have well known roles in the regulation of mRNA stability and degradation. The different up- and downstream sequence preferences can also be seen in the sequence motif similarities vs direction PCA plot (Fig. 11):

Fig. 11: Sequence motif similarity vs direction PCA plot produced by rbpbench nemo.

This plot again shows us the motifs arranged by their similarity, but this time colored by the up- or downstream direction preference (e.g., the more negative the higher the upstream preference). As an example for a significant downstream preference, the hover box shows CSTF2_2 motif (from CSTF2 protein). CSTF2 is known to be a member of the cleavage stimulation factor (CSTF) complex, which recognizes GU-rich elements at mRNA ends and is involved in the 3' end cleavage and polyadenylation of pre-mRNAs.

Visualizing global motif preferences

Motif preferences in spliced full transcripts

To visualize motif preferences on a transcriptome-wide scale, e.g. for all mRNAs, we can use rbpbench searchlongrna. This mode allows us to find motifs specifically on spliced full transcripts (similar to rbpbench searchrna on spliced transcript regions, description of all available modes here). For this first example, we again use the polyadenylation signal sequence, which we can provide as regex (--regex AATAAA), or by defining the sequence motif from the example above. As sequence motif search with FIMO is faster than regex search, we will use the CSTF2_1 motif:

rbpbench searchlongrna --genome hg38.fa --gtf Homo_sapiens.GRCh38.112.gtf.gz --out test_searchlongrna_mrna_pas_out --rbps ALL --motifs CSTF2_1 --mrna-only

The resulting motif hit coverage profile for CSTF2_1 (AATAAA) over all mRNAs is shown in Fig. 12:

Fig. 12: mRNA region motif hit coverage profile for motif CSTF2_1, produced by rbpbench searchlongrna. Number of mRNAs used for plotting: 20,476 mRNAs. Median lengths of mRNA regions: 5'UTR = 127.0, CDS = 1215.0, 3'UTR = 914.0.

By default, all annotated mRNAs are used from the provided GTF file. However, we can also restrict the set of used mRNAs (by supplying our own transcript IDs list via --tr-list). Note that we need to specify --mrna-only, otherwise we will not get the coverage profile plot. The mode is still useful though, as one can filter the resulting motif hits BED file e.g. by 3'UTR, to obtain only 3'UTR hits, and use these hit regions as input to rbpbench searchrna (transcript region search), or any of the single motif enrichment modes (rbpbench enmo, rbpbench nemo). As expected, we can clearly make out the strong prevalence of the AATAAA sequence at the end of 3'UTRs. In addition, we can see a smaller peak at the 3'UTR start, which is due to the partial match with the stop codon UAA (i.e., TAA). As the stop codon is not annotated as part of the CDS in GTF files (unlike the start codon), RBPBench currently annotates it as the first three 3'UTR positions.

As an alternative example, let's use DDX3X which can be seen in Fig. 7d (if we would hover over its data point) as predominantly 5'UTR binding (followed by CDS). This trend in Fig. 7d is based on the DDX3X eCLIP region coverage, but we can also check whether this is true for its motifs:

rbpbench searchlongrna --genome hg38.fa --gtf Homo_sapiens.GRCh38.112.gtf.gz --out test_searchlongrna_mrna_ddx3x_out --rbps DDX3X --mrna-only

Fig. 13: mRNA region motif hit coverage profiles for all DDX3X motifs, produced by rbpbench searchlongrna.

We can see that the DDX3X motifs are also found predominantly in 5'UTRs, as well as decreasingly in frequency along the CDS. Since we have more than one motif for DDX3X (i.e., two: DDX3X_1, DDX3X_2), profiles are plotted for each motif and then also the sum of profiles for the RBP in total. Furthermore, the motif hit coverage profile fits to what is known about DDX3X, namely that it is an RNA helicase involved e.g. in translation initiation. Fig. 14 shows the two motifs:

Fig. 14: DDX3X database motifs together with hit statistics for matched sequences and literature references, produced by rbpbench searchlongrna.

We can see that the motifs (identified from eCLIP data), show a preference for GC-rich sequences, underlining DDX3X's role as an RNA helicase resolving RNA structure.

Motif preferences in long genomic regions

To get motif hits for and look at motif preferences (GTF file needed) in long genomic regions (e.g. whole gene regions), we can utilize rbpbench searchlong. Besides visualizing the genomic region annotations for motif hits on the defined genomic regions, it also visualizes the normalized annotations. This means that the motif hit counts are normalized by the annotation region lengths found in the input regions. In other words, annotation counts are normalized depending on how much the annotation covers the input regions. This removes annotation region length biases introduced in long genomic input regions and can give a better idea of motif prevalences (given a reasonably large input size/number) in certain genomic regions (e.g., the motif tends to occur more often in introns, 3'UTR etc.). Suppose we want to visualize this for two genomic regions (PUM2 + PCAT-1 transcript regions, introns included), together having a region length of 837,078 nt:

cat tr_list.txt
ENST00000361078
ENST00000645463

Let's use these regions to find motifs, again for DDX3X as in the above Fig. 13 + Fig. 14 example, and visualize both annotations with rbpbench searchlong:

rbpbench searchlong --in tr_list.txt --genome hg38.fa --gtf Homo_sapiens.GRCh38.112.gtf.gz --rbps DDX3X --out searchlong_test_ddx3x_out

The resulting visualizations (stored in searchlong_test_ddx3x_out/motif_plots.rbpbench_searchlong.html) look the following:

Fig. 15: Genomic region annotations for DDX3X motif hits (standard vs normalized), produced by rbpbench searchlong. a: Genomic region annotations (standard, i.e., not normalized by annotation region lengths found in the input regions). b: Genomic region annotations (normalized by annotation region lengths found in the input regions).

We can see that almost all DDX3X motif hits are located in intron regions (Fig. 15a). However, this does not tell us much about the prevalence of the motifs in certain regions, as the length of intron annotations is much longer than the length of exon annotations in the input regions. For this we can look at the normalized annotations (Fig. 15b): now we can clearly see that the DDX3X motifs have a strong prevalence in 5'UTR regions, in agreement with the Fig. 13 results.

Additional modes

Plot nucleotide distribution at genomic positions

We can use rbpbench dist to plot the nucleotide distribution at genomic positions. This can be used e.g. to check for potential nucleotide biases at CLIP-seq crosslink positions.

For illustration, we extract genomic stop codon positions from our ENSEMBL GTF file using the helper script gtf_extract_tr_feat_bed.py, and run rbpbench dist on the obtained stop codon regions BED file:

gtf_extract_tr_feat_bed.py --feat stop_codon --gtf Homo_sapiens.GRCh38.112.gtf.gz --out stop_codons.Homo_sapiens.GRCh38.112.bed --uniq-reg
rbpbench dist --in stop_codons.Homo_sapiens.GRCh38.112.bed --genome hg38.fa --out test_dist_out --ext 5

By default the upstream start position of the genomic input regions is used as position zero in the plot (change with --cp-mode option). The amount of up- and downstream context added is defined by --ext. The generated plot (test_dist_out/nt_dist_zero_pos.png) looks the following:

Nucleotide distribution at stop codons

Fig. 16: Nucleotide distribution (position probability matrix) at genomic stop codon positions (human transcript annotations, ENSEMBL GRCh38 release 112).

We can clearly identify the known stop codon triplet sequences (in DNA: TAA, TAG, TGA), starting at position 0.

Comparing conservation scores between two sets of genomic sites

It can be informative to compare the conservation scores of two sets of genomic sites. rbpbench con allows us to do this, supporting both phastCons and phyloP conservation scores for the comparison. Two sets of genomic sites in BED format have to be provided (input sites via --in, control sites via --ctrl-in). For each genomic site the average conservation score is taken (i.e., by averaging over all single position scores), and the distributions of the two sets are compared using the Wilcoxon rank-sum test. A low p-value indicates that input sites have significantly higher conservation scores (effect sizes are provided as well in the HTML report). This mode can thus be used e.g. to compare different sets of motif hit regions, or sets from different peak callers. An HTML report is produced to summarize and visually inspect the results.

Compatible phastCons and phyloP conservation score files (.bigWig format) can be downloaded from the UCSC, e.g. using the conservation scores obtained from alignments of 99 vertebrate genomes to the human genome:

wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/phyloP100way/hg38.phyloP100way.bw
wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/phastCons100way/hg38.phastCons100way.bw

Note that rbpbench con also works with just a phastCons or a phyloP file as input.

As an example, we can again use the mRNA region end positions as input sites (mrna_region_end_pos.bed from the example above), and for the control sites we can shift the end positions downstream by 50 nt:

bed_shift_regions.py --in mrna_region_end_pos.bed --num 50 > mrna_region_end_pos.50ds_shift.bed

This should result in a drop in conservation scores in the control sites. We can simply check this with the following example call:

rbpbench con --in mrna_region_end_pos.bed --ctrl-in mrna_region_end_pos.50ds_shift.bed --phylop hg38.phyloP100way.bw --phastcons hg38.phastCons100way.bw --out mrna_region_end_pos_con_sc_out

Inspecting the conservation score distribution plots in test_con_smrna_region_end_pos_con_sc_outc_out/report.rbpbench_con.html confirms the assumption (plus the p-values are highly significant as well):

Conservation score distribution comparison

Fig. 17: Comparison of phastCons and phyloP conservation score distributions between input sites (mRNA region end positions) and control sites (mRNA region end positions shifted downstream by 50 nt).

Searching for sponge transcripts

To search for sponge transcripts (i.e., transcripts that serve a sponges for RBPs or miRNAs by harboring larger numbers of potential binding sites), we can use rbpbench sponge. Transcript sequences can be provided either as FASTA file, or else we can use the representative transcripts from all annotated genes (GTF + genome FASTA file needed). For example, to look for annotated transcripts (>= 1000 nt) with high match counts of the PUM2 consensus motif (Pumilio Response Element (PRE) of PUM2: UGUANAUA), we can run:

rbpbench sponge --regex 'TGTA[ACGT]ATA' --out test_sponge_search_gtf_out --genome hg38.fa --gtf Homo_sapiens.GRCh38.112.gtf.gz  --min-seq-len 1000

The top 10 transcripts ranked by hits per kilo base (kb) nt we get are the following:

Transcript ID Gene Name Transcript Length Hit Count Hits per kb Transcript Biotype
ENST00000572876 - 1514 6 3.9630 lncRNA
ENST00000661959 - 4683 24 5.1249 lncRNA
ENST00000586020 - 1803 7 3.8824 nonsense_mediated_decay
ENST00000651116 - 1287 4 3.1080 lncRNA
ENST00000669865 - 2811 9 3.2017 lncRNA
ENST00000374865 MRPL50 3336 10 2.9976 protein_coding
ENST00000690469 - 1390 4 2.8777 lncRNA
ENST00000641352 - 4068 12 2.9499 lncRNA
ENST00000369489 CD58 1086 3 2.7624 protein_coding
ENST00000565493 NORAD 5401 15 2.7773 lncRNA

It is known that NORAD, a highly conserved cytoplasmic lncRNA, can sequester PUMILIO proteins (PUM1 and PUM2) and modulate the expression of their target genes (Lee et al. 2016). This shows that a simple and fast search approach like this can be used to identify potential sponge transcripts for any given (RBP) motif of interest, which can then be further investigated. Additional options are available, such as defining a minimum spacer length between regex hits (--min-spacer-len), or a minimum regex hit count per transcript (--min-hit-count) for transcripts to be included in the output table. Moreover, if GTF annotated transcripts are used, specific subsets can be selected for search via --select-mode (all, only mRNAs, only 3'UTRs). Also note that regular expressions can describe complex binding patterns, including variable length spacers, such as the IGF2BP3 binding motif GGC.{15,25}CA.{7,20}CA.{15,25}GGC.{2,8}[CA]{4} described in Schneider et al. 2019.

Comparing motif hits between transcript isoforms

To check whether there are differences in motif hits between transcript isoforms, we can use rbpbench isocomp. This will calculate differences in motif hit counts (both absolute and and normalized by transcript lengths) between all annotated transcript isoforms of a gene, and report them in a table. Again it is possible to provide transcript sequences via a FASTA file (--fasta), or to use annotated transcripts from a GTF file (given GTF and genome FASTA file). Moreover, the type of extracted sequences from the GTF file can be further defined: We can use either 3'UTR, 5'UTR, full mRNA, full non-coding, or all full transcripts from the GTF file (choose via --select-mode). For example, choosing all annotated 3'UTR sequences and again using the PUM2 consensus motif TGTA[ACGT]ATA):

rbpbench isocomp --regex "TGTA[ACGT]ATA" --out test_isocomp_gtf_out --genome hg38.fa --gtf Homo_sapiens.GRCh38.112.gtf.gz --select-mode 1

This allows us to quickly check for genes where there are certain isoforms with changes in motif hit occurrences (sorted by magnitude of change, i.e., difference between hits per kb), which can help in the further functional characterization of the isoform and its underlying gene. Note that in case of user-supplied transcript sequences (via --fasta), the FASTA header needs to have the format >transcript_id,gene_id for successful isoform assignment.

Finding similar sequences based on motif hit and k-mer profiles

RBPBench's mode rbpbench searchseq offers the creation of motif hit and k-mer profiles for each input sequence, which in turn allows us to show similarities of input sequences based on these profiles (enable with --profiles). More specifically, PCA plots (one based on motif hits, one based on k-mer occurrences) are generated in the HTML report. Furthermore, a specific input sequence can be selected (--profiles-seq-id), which is then used as a reference, i.e., by highlighting it in the plots, as well as reporting the top n (set via --profiles-top-n) most similar other sequences based on cosine similarity and euclidean distance (using the motif hit and k-mer occurrence vectors for calculation). For example, to generate motif hit + k-mer profiles for a set of input sequences in FASTA format (input_seqs.fa) and use the sequence ENST00000270722 as a reference, we can run:

rbpbench searchseq --in input_seqs.fa --out searchseq_profiles_test_out --rbps ALL --profiles --profiles-level 1 --profiles-k 5 --profiles-top-n 20 --min-seq-len 100 --max-seq-len 5000 --profiles-seq-id 'ENST00000270722'

Here we select all motifs in the database (--rbps ALL) and use some additional options, such as --profiles-level to define on which hit level to generate the motif hit profile (RBP or individual motif level), --profiles-k to set the k-mer size for the k-mer profiles, as well as controlling the minimum and maximum input sequence lengths to consider (--min-seq-len, --max-seq-len).

Documentation

This documentation provides further details on RBPBench (version 1.1.x).

Program modes

RBPBench is a multi-function tool featuring several modes of operation. To get an overview of the currently available modes:

$ rbpbench -h
usage: rbpbench [-h] [-v]
                {search,batch,compare,searchseq,searchregex,searchlong,searchrna,searchlongrna,enmo,nemo,con,sponge,isocomp,streme,tomtom,goa,optex,dist,info}
                ...

Evaluate CLIP-seq and other genomic region data using a comprehensive collection of known RBP binding motifs (RNA sequence +
structure). RBPBench can be used for a variety of purposes, from RBP motif search in genomic regions, over motif enrichment and co-
occurrence analysis, in-depth comparisons over multiple datasets via sequence and genomic annotation statistics, to benchmarking
CLIP-seq peak callers, as well as comparisons across cell types and CLIP-seq protocols.

positional arguments:
  {search,batch,compare,searchseq,searchregex,searchlong,searchrna,searchlongrna,enmo,nemo,con,sponge,isocomp,streme,tomtom,goa,optex,dist,info}
                        Program modes
    search              Search motifs in genomic sites
    batch               Search motifs on > 1 dataset
    compare             Compare different search results
    searchseq           Search motifs in sequences
    searchregex         Search regex in genomic sites or sequences
    searchlong          Search motifs in long genomic regions
    searchrna           Search motifs in spliced transcript sites
    searchlongrna       Search motifs in spliced full transcripts
    enmo                Check for enriched motifs in input sites
    nemo                Check for neighboring motifs in input sites
    con                 Compare conservation in genomic sites
    sponge              Check for sponge transcripts
    isocomp             Compare motif hits on transcript isoforms
    streme              Discover motifs in input sites using STREME
    tomtom              Compare motif(s) with database using TOMOTM
    goa                 Run GO enrichment analysis on gene list
    optex               Investigate optimal extension
    dist                Plot nt distribution at genomic positions
    info                Inform about motif database content

options:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit

Single and batch motif search

rbpbench search can be used to search for RBP binding motifs in genomic regions, selecting any number of RBPs (database or user-defined (sequence or structure), or regular expressions) and to look at motif enrichment and co-occurrences. rbpbench batch is rbpbench search extended to multiple input files (BED files containing genomic regions), with one RBP to search for each input file. Informative HTML reports are produced in most of the modes, e.g. for rbpbench batch the input datasets are compared via various interactive plots and statistics to learn more about common as well as dataset-specific features.

Compare search results

rbpbench search and rbpbench batch produce the RBP and motif hit statistics files which can then be input into rbpbench compare in order to benchmark peak callers, or generate comparisons between different cell types and CLIP-seq protocols. Again an HTML report is produced with plots for each possible comparison found in the input files.

Additional motif search modes

Additional more specialised search modes are available as well: rbpbench searchseq allows inputting FASTA sequences to search for motifs, as well as showing motif hit and k-mer similarities in an HTML report over all input sequences. rbpbench searchregex allows regular expression (regex) search in genomic regions or sequences, and supports single as well as multiple regexes (more details on regexes here). Note that regex search is also possible in the other search modes, e.g. combined with the database sequence and structure motifs. rbpbench searchlong allows searching long genomic regions for RBP motifs and visualize motif prevalences in genomic regions (e.g. whether a motif tends to occur more often in introns, 5'UTRs, 3'UTRs, etc.). rbpbench searchrna enables RBP motif search in transcript regions, i.e., in spliced transcripts, which can be useful for studying e.g. RBP binding on mRNAs. rbpbench searchlongrna similarly allows for RBP motif search in spliced transcripts, but broadens the search to all transcripts (or a user-defined set) of the GTF provided transriptome (optionally only on mRNAs), and e.g. outputs mRNA region preferences of the specified motifs (5'UTR, CDS, 3'UTR coverages).

Single motif enrichment search modes

The next two modes, rbpbench enmo and rbpbench nemo, further zoom in on single motif level analysis of motif enrichment and co-occurrence. While rbpbench enmo looks at enriched motifs over background sequences in general, rbpbench nemo goes one step further and searches for significantly enriched neighboring motifs. I.e., given some genomic or transcript positions, rbpbench nemo checks whether there are enriched motifs up- or downstream of the input positions, and also tests for statistical preference regarding up- or downstream occurrence.

Checking for similar database motifs

If you have one or more sequence motifs of interest (MEME format or regex) and want to know if there are similar motifs in the database, you can run rbpbench tomtom (incorporating MEME's TOMTOM). This mode also informs us whether the reported motif hits are enriched in certain RBP functions (e.g., for regex input --in TTTTTT, the top 3 enriched functions are: splicing regulation, translation regulation, and RNA stability & decay).

More modes

rbpbench con be used to compare conservation scores in two sets of genomic sites. For example, we can compare different sets of motif hit regions, or sets from different peak callers. rbpbench sponge allows to search for sponge transcripts, i.e., transcripts that serve as sponges for RBPs or miRNAs. rbpbench isocomp can be used to check whether there are differences in motif hits between transcript isoforms of a gene. rbpbench optex allows us to investigate the optimal extension of input regions for motif search, i.e., to find the best up- and downstream extension for motif search in genomic regions. rbpbench streme allows to discover new motifs using STREME from MEME suite. rbpbench goa enables us to run GO term enrichment analysis (GOA) on a set of genes, e.g. obtained from the output tables of other modes (like genes covered by input regions from rbpbench search or rbpbench batch). Note that some of the modes also can run GOA, optimized for the specific mode (see GO term analysis). Furthermore, we can use rbpbench dist to plot the nucleotide distribution at genomic positions, and rbpbench info informs about database RBP motifs and annotated RBP functions.

Connecting modes

All motif search modes additionally output motif hit regions in BED format. These files for each hit also contain the matched sequence, plus the genomic region annotation (if a GTF file was supplied). This way, the motif hit regions themselves (or a subset of interest after filtering, e.g. certain matched sequences or only motifs residing in 3'UTRs) can again be used as input regions for the other modes, allowing for easily refined hypothesis checking.

Inputs

Motifs

Motifs for search can be sequence motifs (MEME motif format), structure motifs (covariance models obtained through Infernal), or regular expressions (given via --regex argument). Search motifs are selected via --rbps, e.g., to select PUM1 (sequence) and SLBP (structure) motifs --rbps PUM1 SLBP. To select all database motifs, set --rbps ALL (internal motif database can be changed via --motif-db, a custom motif database can be supplied too). Additionally, a regular expression (regex) (e.g. AATAAA) can be added to search via --regex AA[AG]AA. To search only for a regular expression, set --rbps REGEX --regex AA[AG]AA. Co-occurrence and enrichment statistics are calculated on the RBP level in all modes, except rbpbench enmo and rbpbench nemo, which enable enrichment and co-occurrence statistics on single motif level (details). Alternatively, single motifs for search can be selected via --motifs, e.g. --motifs CSTF1_1 DDX3X_1, and sequence motifs (MEME format) can be filtered by their length using --motif-min-len and --motif-max-len. To list and visualize all selected motifs, use --plot-motifs (in some modes automatically output). RBP motifs can also be filtered by their annotated RBP functions. To only inlcude RBPs with e.g. annotated 3' end processing function, set --functions TEP (for available functions and annotations, run rbpbench info). Moreover, if you have a motif of interest (or a regex), and want to know if there are similar motifs in the database, you can use rbpbench tomtom (using MEME SUITE TOMTOM). This mode also informs us whether the reported motif hits are enriched in certain RBP functions (e.g. for regex input --in TTTTTT, the top 3 enriched functions are: splicing regulation, translation regulation, and RNA stability & decay).

Regular expressions

The --regex argument conveniently allows to search for user-defined sequence and structure patterns in genomic regions, transcript regions, or sequences. A regular expression string can be either a simple sequence motif string like AATAAA, but also more complex ones like [CA]CA[CT].{10,25}CGGAC. Note that the full IUPAC nucleotide code is supported, i.e., one can write either AA[ACGT][AG]TT or AANRTT. Moreover, --regex accepts structure patterns, such as AA((((ARNA))))CCC (where ( and ) denote base pairs), AA(((A[A)))CC(((C]C)))CC (where [ and ] denote pseudoknot base pairs), and even structure patterns with variable length and nucleotide content spacers in it, such as AA(((AA(((.)))CC)))CC (where . denotes the variable spacer part). Furthermore, if a structure pattern is given, a minimum GC base pair fraction (--regex-min-gc), a maximum GU base pair fraction (--regex-max-gu), and a minimum and maximum spacer length (--regex-spacer-min, --regex-spacer-max) can be defined to fine-tune the structure pattern search.

Genomic regions

Genomic input regions have to be provided in BED format. The first 6 columns are mandatory and are expected to mean the following: chromosome ID, genomic region start position (0-based index), genomic region end position (1-based index), region ID, region score, region strand (plus(+) or minus(-) strand). Here's a valid input example from the test folder:

$ head -5 eclip_clipper_idr/PUM1_K562_IDR_peaks.bed 
chr20	62139082	62139128	PUM1_K562_IDR	3.43241573178832	-
chr20	62139146	62139197	PUM1_K562_IDR	3.35874858317445	-
chr7	6156853	6157005	PUM1_K562_IDR	4.85347590189745	+
chr15	82404676	82404753	PUM1_K562_IDR	4.1721908622051	+
chr1	19094999	19095025	PUM1_K562_IDR	5.11052671530143	-

Additional columns (> 6) will be ignored, although the region score column can be different from column 5 (default). You can specify which column should be used as region score via --bed-score-col (used for Wilcoxon rank-sum test and optionally for filtering out regions by --bed-sc-thr). Before motif search, the input regions are filtered and optionally extended via --ext (e.g. --ext 20 for up- and downstream extension of 20 nt or --ext 30,10 for different up- and downstream extension). Regions with invalid chromosome IDs are removed. Furthermore, duplicated regions are merged into one region (same chromosome ID + start + end + strand).

Transcript regions

Transcript region input is also supported (rbpbench searchrna, rbpbench enmo, rbpbench nemo). Here the input regions look the following:

ENST00000645463	100	150	s1	0	+
ENST00000645463	300	350	s2	0	+
ENST00000561978	450	500	s3	0	+

The defined transcript IDs need to be present in the supplied GTF file (--gtf). This allows motif search and subsequent statistics (enrichment, co-occurrence, GO term analysis etc.) directly on the transcript regions, which enables us to examine motif binding directly on the spliced transcripts.

Genomic sequences

Genomic sequences have to be provided in FASTA format. For example, the human genome sequence (hg38.fa) can be obtained from here. To download in the command line and unpack:

wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz

Genomic annotations

Many of the statistics require a genomic annotations file in GTF format in order to be generated. RBPBench was tested mainly using GTF files downloaded from Ensembl, but you can also provide a GTF file from e.g. GENCODE. For the examples we downloaded the following GTF file:

wget https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz

User-provided motif search

Both sequence (MEME XML format) and structure (covariance model .cm) motifs can be supplied by the user on top of the database RBPs (using -rbps USER option together with -user-meme-xml or --user-cm). This way, the motifs can be used together with the database motifs for search and comparative statistics. For example, to supply a structure motif (SLBP) via --user-cm (the example motif files can be found in the RBPBench repository subfolder RBPBench/test):

rbpbench search --in SLBP_K562_IDR_peaks.bed --rbps USER --out SLBP_user_search_out --genome hg38.fa --user-cm SLBP_USER.cm  --user-rbp-id SLBP_USER

In the same way, we can supply sequence motif(s) (PUM1) via --user-meme-xml, and e.g. also combine it with any of the database RBPs (here PUM2 and RBFOX2):

rbpbench search --in eclip_clipper_idr/PUM1_K562_IDR_peaks.bed --rbps USER PUM2 RBFOX2 --out PUM1_user_search_out --genome hg38.fa --user-meme-xml PUM1_USER.xml --user-rbp-id PUM1_USER

Custom motif database

Apart from the built-in motif database and the option of user-supplied RBP motifs, it is also possible to define a custom motif database, which can then be applied just like the built-in motif database. The following command line parameters deal with defining a custom motif database:

  --custom-db str       Provide custom motif database folder. Alternatively, provide single files via --custom-db-meme-xml, --custom-
                        db-cm, --custom-db-info
  --custom-db-id str    Set ID/name for provided custom motif database via --custom-db (default: "custom")
  --custom-db-meme-xml str
                        Provide custom motif database MEME/DREME XML file containing sequence motifs
  --custom-db-cm str    Provide custom motif database covariance model (.cm) file containing covariance model(s)
  --custom-db-info str  Provide custom motif database info table file containing RBP ID -> motif ID -> motif type assignments

The database can be input as a folder (--custom-db db_folder_path), which needs to contain an info.txt file, as well as a sequence motifs file (seq_motifs.meme, MEME motif format), and/or a structure motifs file (str_motifs.cm, covariance model format). The info.txt is a table file containing the RBP ID to motif ID mappings. Here is an example of a valid info.txt (minimum 3 columns required: RBP_motif_ID, RBP_name, Motif_type) file content:

$ cat db_folder_path/info.txt
RBP_motif_ID	RBP_name	Motif_type
A1CF_1	A1CF	meme_xml
A1CF_2	A1CF	meme_xml
ACIN1_1	ACIN1	meme_xml
ACIN1_2	ACIN1	meme_xml
ACO1_1	ACO1	meme_xml
RF00032	SLBP	cm

You can also specify additional columns, which are used for the internal default databases, and are taken into account, e.g., when filtering by RBP functions, or when providing literature references in the HTML report files:

RBP_motif_ID	RBP_name	Motif_type	Organism	Gene_ID	Function_IDs	Reference	Experiment	Comments
A1CF_1	A1CF	meme_xml	human	ENSG00000148584	RM;RSD;RE	31724725;10669759	RBNS_ENCODE;RBPDB	-
A1CF_2	A1CF	meme_xml	human	ENSG00000148584	RM;RSD;RE	31724725	RBNS_ENCODE	-
ACIN1_1	ACIN1	meme_xml	human	ENSG00000100813	-	27365209	iCLIP	-
ACIN1_2	ACIN1	meme_xml	human	ENSG00000100813	-	27365209	iCLIP	-
ACO1_1	ACO1	meme_xml	human	ENSG00000122729	RSD;TR	8021254	cisBP-RNA	-
RF00032	SLBP	cm	human	ENSG00000163950	TEP;VRR	34086933	catRAPID_omics_v2.1	-

Motif_type defines whether a motif is a sequence motif (expected to be found in seq_motifs.meme), or a structure motif (expected to be found in str_motifs.cm). An ID / name for the custom database can be defined as well via --custom-db-id. Alternatively, the files can be input separately via --custom-db-info, --custom-db-meme-xml, and --custom-db-cm. For example, all human RBP motifs from CISBP-RNA can be found in the test/ folder, and we can simply use them as a custom database like this:

rbpbench search --in input_sites.bed --genome hg38.fa --out test_cisbp_rna_search_out --rbps ALL --ext 10 --custom-db-meme-xml cisbp_rna_human_rbp_motifs.meme --custom-db-info cisbp_rna_human_rbp_motifs.info.txt --motif-min-len 6 --motif-max-len 7

Here we also filter the input sequence motifs by length to allow only motifs of length 6 or 7 nt to be included in the search.

If you have some short sequences or regular expressions (regexes), you can also quickly create a MEME format motif database out of these (alternatively use rbpbench searchregex which supports multiple regexes as input). Only thing needed is a table file containing regexes and associated RBP/motif IDs:

$ cat custom_motifs.tsv
rbp_id	motif_id	regex
RBPX	RBPX_1	AAA[CG]A
RBPX	RBPX_2	AAA[CGT]C
RBPY	RBPY_1	ACGT[ACGT]

To create the database folder, we use one of the helper scripts:

create_custom_meme_motif_db.py --in custom_motifs.tsv --out custom_db_out

Afterwards the folder can be used as described above via --custom-db option, and you can specify any RBPs or motifs you want to search for from the database:

rbpbench search --in genomic_regions.bed --genome hg38.fa --gtf Homo_sapiens.GRCh38.112.gtf.gz --rbps ALL --custom-db custom_db_out --out custom_db_search_out --ext 10 --plot-motifs --fimo-pval 0.005

Here we used --plot-motifs to visualize the motifs and --rbps ALL, meaning all motifs from the custom database are included in search. To search only for specific motifs from the specified database, simply add --motifs RBPX_1 RBPX_2. Note that RBPBench's default FIMO threshold to filter out motif hits is optimized for 6-nt or longer sequence motifs. If your motifs are shorter, consider setting a more relaxed threshold (e.g., like in the example --fimo-pval 0.005). However, keep in mind that a relaxed setting also results in predicting more non-perfect hits, especially for longer motifs. You can easily check by inputting sequences with motifs you expect to be predicted into rbpbench searchseq with --custom-db option.

FIMO nucleotide frequencies

What p-value a motif hit reported by FIMO gets depends on the set background nucleotide frequencies (FIMO option: --bfile). By default, frequencies obtained from human ENSEMBL transcripts (excluding introns, A most prominent nucleotide) are used. You can change the preset via --fimo-ntf-mode (other options: transcripts including introns, or uniform distribution). Alternatively, you can provide your own background frequencies file via --fimo-ntf-file.

Filtering by expressed genes

In rbpbench search, the specified RBP motifs can be further filtered by a list of expressed genes, provided via --exp-gene-list. This way, only RBPs are selected for search which are also expressed in a specific condition. The format of the given --exp-gene-list file should be one gene ID per line, and the gene ID needs to be from ENSEMBL with no version numbers. For example:

$ cat expressed_genes.txt
ENSG00000099783
ENSG00000100320
...

Furthermore, the genomic input regions BED file (--in) can be filtered by those regions (--exp-gene-filter), to only keep --in regions overlapping with the expressed genes. In some cases it might also be interesting to only keep --in regions that do NOT overlap with the expressed genes (--exp-gene-mode 2).

Outputs

RBPBench depending on the mode outputs various useful output files (e.g. motif hits BED, region annotations, hit statistics files, HTML reports with statistics and visualizations).

Hit statistics table files

RBPBench's motif search modes (e.g. rbpbench search, rbpbench batch) output hit statistics table files which can later be used to do comparisons between peak callers or other specified conditions (e.g., cell types, CLIP-seq protocols). RBPBench outputs the RBP binding motif hit statistics into two table files stored in the results output folder (--out):

RBP hit stats .tsv:
results_output_folder/rbp_hit_stats.tsv
Motif hit stats .tsv:
results_output_folder/motif_hit_stats.tsv

The first file (RBP hit stats) contains comprehensive hit statistics for each RBP (one row per RBP), while the motif hits stats file contains hit statistics for each single motif hit (one row per motif hit).

The RBP hit statistics file rbp_hit_stats.tsv contains the following columns:

Column name Description
data_id Set Data ID (--data-id). More here
method_id Set method ID (--method-id). More here
run_id Set run ID (--run-id)
motif_db Selected motif database for search (--motif-db)
rbp_id RBP ID (i.e., RBP name), e.g. PUM1
c_regions Number of input genomic regions used for search (after filtering and extension operations)
mean_reg_len Mean length of genomic regions
median_reg_len Median length of genomic regions
median_reg_len Median length of genomic regions
min_reg_len Minimum length of genomic regions
called_reg_size Called size of genomic regions (including overlaps)
effective_reg_size Effective size of genomic regions (removed overlaps)
c_reg_with_hits Number of regions with motif hits from rbp_id
perc_reg_with_hits Percentage of regions with motif hits from rbp_id
c_motif_hits Total number of motif hits from rbp_id
c_uniq_motif_hits Number of unique motif hits from rbp_id (removed double counts)
c_uniq_motif_nts Number of unique motif nucleotides from rbp_id (removed overlaps)
perc_uniq_motif_nts_cal_reg Percentage of unique motif nucleotides over called region size
perc_uniq_motif_nts_eff_reg Percentage of unique motif nucleotides over effective region size
uniq_motif_hits_cal_1000nt Number of motif hits over 1000 nt of called region size
uniq_motif_hits_eff_1000nt Number of motif hits over 1000 nt of effective region size
wc_pval Wilcoxon rank-sum test p-value to test whether motif hit regions tend to feature higher scores
wc_rbc_eff_size Rank-biserial correlation effect size for Wilcoxon rank-sum test
wc_cl_eff_size Rank-biserial correlation effect size for Wilcoxon rank-sum test
seq_motif_ids Sequence motif IDs. Empty (-) if rbp_id has not sequence motifs
seq_motif_hits Sequence motif hit counts (count for each motif ID)
str_motif_ids Structure motif IDs. Empty (-) if rbp_id has not structure motifs
str_motif_hits Structure motif hit counts (count for each motif ID)
internal_id Internal ID (unique for each rbp_id run), used for connecting table results

The motif hit statistics file motif_hit_stats.tsv contains the following columns:

Column name Description
data_id Set Data ID (--data-id). More here
method_id Set method ID (--method-id). More here
run_id Set run ID (--run-id)
motif_db Selected motif database for search (--motif-db)
region_id Genomic region ID containing the hit, e.g. chr1:228458485-228458560(+)
rbp_id RBP ID (i.e., RBP name), e.g. SLBP
motif_id Motif ID
chr_id chromosome ID
gen_s genomic motif hit start (1-based)
gen_e genomic motif hit end (1-based)
strand Chromosome strand (+ or - strand)
region_s region motif hit start (1-based)
region_e region motif hit end (1-based)
region_len Region length
uniq_count Unique motif hit count
fimo_score FIMO score (for sequence motif hits)
fimo_pval FIMO p-value (for sequence motif hits)
cms_score cmsearch score (for structure motif hits)
cms_eval cmsearch e-value (for structure motif hits)
matched_seq matched motif hit sequence
internal_id Internal ID (unique for each rbp_id run), used for connecting table results

Motif hits BED files

Motif hits are also output in BED format in each search mode. This together with the input regions BED allows for quick studying of motif occurrences in a genome viewer (e.g., IGV). Additionally, the motif hits BED file can be further filtered by matched sequence (BED column 12) or genomic region annotation (BED column 11 if --gtf was provided, e.g. filter by 3'UTR to keep only 3'UTR hits). These motif hit regions can then again be used as search mode input regions, allowing for refined hypothesis testing.

There are various settings for controlling motif hit search. E.g., a FIMO p-value threshold for reporting sequence motif hits (--fimo-pval), or a CMSEARCH bit score threshold for reporting structure motif hits (--cmsearch-bs, also see --cmsearch-mode). Furthermore, --greatest-hits enabled results in only the best hit (lowest p-value, highest bit score) being reported for each input region.

Statistics

Input region score motif enrichment statistics

Given a set of input regions with associated scores (by default BED column 5 is used, change via --bed-score-col), RBPBench for each selected RBP checks whether motif-containing input regions have significantly higher scores (using Wilcoxon rank-sum test). This means that a low test p-value for a given RBP indicates that higher-scoring regions are more likely to contain motif hits of the respective RBP. If we assume that the score (e.g. log2 fold change) is somehow correlated with RBP binding affinity, the test thus can give clues on which RBPs preferentially bind to the provided regions. Note that the test is only informative if the scores are themselves informative w.r.t. RBP binding (e.g., not all the same), or e.g. if not all or too many of the input regions contain motif hits. For p-value interpretation, two test effect sizes are provided as well (RBC ES: rank-biserial correlation effect size, CL ES: common language effect size). RBC ES has a range from -1 to +1, where 0 means no effect, +1 means all hit region scores are greater than non-hit region scores, and -1 means all hit region scores are smaller that non-hit region scores. CL ES has a range 0 to +1. It denotes the probability that a random score from the hit region group exceeds one from the non-hit region group. A CL ES of 0.5 means there is no effect, > 0.5 means hit region scores tend to be higher, and < 0.5 means hit region scores tend to be lower.

The test results are output in the output tables, as well as in the HTML reports. The test can also check for significantly lower scores (change via --wrs-mode). Moreover, in rbpbench batch, the test can be applied for a user-defined regex (via --regex). This conveniently allows us to check, e.g. over all eCLIP datasets, whether a given regex is significantly enriched in any dataset (results again output to HTML report).

Co-occurrence statistics

To test for the co-occurrence of RBP motifs in a set of input regions, Fisher's exact test is applied, effectively counting the number of occupied and non-occupied regions for each pair of RBPs. This is done on the RBP level (rbpbench search, rbpbench searchrna), meaning that all motifs assigned to the same RBP are aggregated in the statistic, but can also be done on the single motif level for a more fine-grained analysis (rbpbench enmo, rbpbench nemo). Co-occurrence results are output as interactive heat maps into the respective HTML reports. Co-occurrence p-value threshold and multiple testing correction method can be adapted via --cooc-pval-thr and --cooc-pval-mode (default: Benjamini Hochberg). Moreover, Fisher exact test alternative hypothesis can be changed (--fisher-mode) from greater to two-sided or less, so instead of reporting significantly higher co-occurrences, we can e.g. instead report significantly lower co-occurrences (i.e., the RBPs or motifs that tend NOT to co-occur together). Since reported motifs between RBPs can often be very similar, we can further set a minimum mean distance between the motif hits required for them to be reported as significant (--min-motif-dist). This allows us to ignore significant co-occurrences between very similar motifs (i.e., they are reported as not significant in the heat map). Another way of influencing co-occurrences is to filter the input regions by their region score (if the score is meaningful), which can be done via --bed-sc-thr.

Considering the single motif level co-occurrences (rbpbench enmo, rbpbench nemo), co-occurrence attribution is more strict, since it is only checked for motifs which are significantly enriched in the input regions (details here). Furthermore, in addition to --min-motif-dist, co-occurrences can be filtered by motif pair similarity (only for MEME motif formatted sequence motifs, --motif-sim-thr), which allows us to focus on co-occurrences with more dissimilar motifs.

Input region motif enrichment statistics

Not to be confused with the region score statistic above, input region motif enrichment statistics are calculated in the modes rbpbench enmo, rbpbench nemo, using the motif occurrences in the input region dataset and comparing it to their occurrences in a background dataset (again via Fisher's exact test). The background set can be generated in various ways, either by shuffling the input sequences (--bg-mode 2), or by sampling background regions from the genome or transcriptome (default: --bg-mode 1if input regions are on transcripts). Various options to modify background set generation exist: --bg-min-size to sample additional background regions; --bg-mask-bed to define regions from which no background regions should be sampled; --bg-incl-bed to define regions from which to sample background regions; --bg-shuff-factor, --bg-shuff-k to customize shuffling; and some more options (--bg-mask-blacklist, --bg-ada-sampling, --random-seed, check rbpbench enmo -h for details). Similar to the co-occurrence statistics, p-value threshold and multiple testing correction method can be adapted (--enmo-pval-thr, --enmo-pval-mode, --nemo-pval-thr, --nemo-pval-mode), and --fisher-mode is also available.

Additional region annotations

Additional region annotation statistics are output in the HTML reports of rbpbench search and rbpbench batch. These include the percentages of input regions that overlap with: regions outside of gene regions, putative promoter regions, as well as user-defined regions (via --add-annot-bed, with the option --add-annot-comp to use the complement of the provided regions). Promoter regions can be further defined via --prom-ext (by default using regions 1000 nt upstream to 100 nt downstream of the transcript start positions), --prom-min-tr-len (minimum transcript length for promoter region extraction, by default all lengths), and --prom-mrna-only (using only mRNA transcript regions, by default all selected transcripts, details here). These statistics are useful to check whether the input regions are located in the expected genomic regions. For example, high percentages of input regions located outside gene regions or inside promoter regions can point at dataset issues (assuming RBPs bind primarily to gene/transcript regions) or distinct protein functions (e.g., binding to nascent transcripts).

Additional information

Handling genomic overlaps

As input genomic regions can overlap (also due to extending them via --ext), RBPBench considers both the called genomic region size as well as the effective genomic region size. While the called size ignores overlaps, the effective size only counts unique input regions (i.e., merging overlapping parts, only counting them once). This also holds for the reported motif hits, where unique counts refer to effective counts. For example, a motif at a specific genomic location can appear several times in the input data. The unique count takes care of this and counts it only once. This is important e.g. when comparing the results of different peak callers. Another interesting statistic (full list of statistics and descriptions here) is e.g. the number of unique motif hits over 1000 nt of called and effective region size. This gives us an idea of how many motifs are included in the regions, normalized over the total size of the regions (called or effective size).

Comparing different search results

Both single (rbpbench search) and batch search (rbpbench batch) results (i.e., their output folders) can be combined and compared using rbpbench compare. What search results are going to be compared is defined by the set method and data IDs (in rbpbench search via --method-id, --data-id, in rbpbench batch best to use input table via --bed, see example above). rbpbench compare essentially looks for compatible combinations in the provided search output folders. There are two types of possible comparisons:

  1. compare search results based on common data ID and RBP ID (thus comparing different methods IDs)
  2. compare search results based on common data ID and RBP ID (thus comparing different data IDs)

So whenever there are search results encountered in the provided search output folders, rbpbench compare looks for cases where the RBP ID is the same, plus either the method ID or the data ID being fixed. For example, suppose we have the following table that we can use as input to rbpbench batch:

cat batch_compare_test.batch_in.txt
PUM1	clipper_idr	k562_eclip	batch_compare_test/PUM1.k562_eclip.clipper_idr.bed
PUM1	dewseq_w100_s5	k562_eclip	batch_compare_test/PUM1.k562_eclip.dewseq_w100_s5.bed
RBFOX2	clipper_idr	hepg2_eclip	batch_compare_test/RBFOX2.hepg2_eclip.clipper_idr.bed
RBFOX2	clipper_idr	k562_eclip	batch_compare_test/RBFOX2.k562_eclip.clipper_idr.bed

Column 1 is the RBP ID, column 2 the method ID, column 3 the data ID, and column 4 the path to the BED file containing the genomic regions (typically binding regions of the RBP determined through CLIPseq). In this example, method ID describes the peak calling methods, while data ID describes the CLIP protocol+cell type combination. From these, we ge the following to comparisons:

  1. compare the peak calling methods (i.e., their results) clipper_idr, dewseq_w100_s5 with common data ID k562_eclip and RBP ID PUM1
  2. compare the conditions (i.e., eCLIP on two cell types HepG2 and K562) hepg2_eclip, k562_eclip with common method ID clipper_idr and RBP ID RBFOX2

For the first comparison, we thus have two datasets (i.e., two different peak calling methods) that get compared, and for the second two datasets (i.e., two different conditions). Each comparison is output by rbpbench compare in a separate section of the output HTML file.

RBP functions

The selected RBPs can be further filtered by their annotated RBP functions prior to search. Currently the following RBP functions appear in the database:

Function ID                     Function description
EJC                           Exon Junction Complex
MRP                             microRNA processing
MTR                    mitochondrial RNA regulation
PSG                        P-body / stress granules
RBT                    Ribosome & basic translation
RE                                       RNA export
RL                                 RNA localization
RM                                 RNA modification
RRP                                 rRNA processing
RSD                           RNA stability & decay
SNT                     snoRNA / snRNA / telomerase
SR                              Splicing regulation
SS                                      Spliceosome
TEP                               3' end processing
TR                           Translation regulation
TRR                                 tRNA regulation
VRR                            Viral RNA regulation

For example, to select all RBPs with annotated 3' end processing and RNA stability & decay functions, specify in any search mode --rbps ALL --functions TEP RSD. To see which RBPs are assigned to which functions, run rbpbench info.

GO term analysis

GO term (enrichment) analysis (GOA) can also be performed with RBPBench (see --goa and related options). What target genes are selected for enrichment in GO terms depends on the set mode and command line options. E.g. in rbpbench search, by default GOA is performed on all genes covered by input regions. This can be changed via --goa-cooc-mode, allowing to select only genes with motif hits from any specified RBPs or from all specified RBPs to test different hypotheses. This differs from rbpbench batch, where GOA is performed on genes with input regions from all input datasets (further restricted via --goa-only-cooc, focussing only on genes with motif containing input regions from all datasets). A third way to run GOA is offered by rbpbench searchlongrna, which searches transcripts (e.g. all or selected sets of mRNA transcripts) for motifs, and then uses the transcripts (i.e., its underlying genes) with motif hits (either by any RBP or all RBPs --goa-only-cooc) as target genes for GOA. We can thus test diverse hypotheses by taking advantage of RBPBench's various program modes. If you just want to perform GOA on some specific gene list, use rbpbench goa. Some useful options are: --goa-max-child (e.g. --goa-max-child 200 to filter out very broad GO terms), --goa-min-depth (e.g. --goa-min-depth 5 minimum depth for significant GO terms), or --goa-pval (set GOA p-value threshold).

Most prominent transcript selection

In order to obtain genomic region annotations on the transcript level, by default one representative transript (i.e., the most prominent transcript (MPT)) is chosen for each gene in the GTF file. MPT selection aims to select the transcript with the strongest experimental support. Currently, for a pair of transcripts from the same gene, to decide which one is "more prominent":

  1. Choose transcript with "MANE_select" (details) tag if present
  2. If "MANE_select" tag not present, choose transcript with "basic" tag
  3. If both have "basic" tag, chose transcript with higher transcript support level (TSL)
  4. If same TSL, chose transcript with "Ensembl_canonical" tag
  5. If both have "Ensembl_canonical" tag, chose the longer transcript as current MPT

This is done for all possible pairs, which leaves one transcript as MPT in the end for each gene. Regions that do not overlap with the selected transcript regions are assigned to "intergenic". Note that for lncRNA genes, MPT selection favors transcripts with the gencode_primary tag if present. Alternatively, a list of transcript IDs can be supplied --tr-list, bypassing the MPT selection. Which region annotations are to be considered can be further defined via --tr-types, and the minimum region annotation overlap can be set via --gtf-feat-min-overlap.

Comparing top scoring and bottom scoring sites via k-mer distribution

In rbpbench search, we can further analyze the input regions by looking at their k-mer contents (see Fig. 4a example). Specifically, we can split the input sequences based on their site scores, and see if the k-mer distributions differ between the top scoring and bottom scoring sites. This usually works well if the BED region scores (default in column 5, change via --bed-score-col) are somehow indicative of the binding site quality, and thus also the presence of known binding motifs. This presence (or change in k-mer contents) can then be observed in the k-mer distribution plot. Further options include setting k (--kmer-plot-k), or change the splitting behavior via setting --kmer-plot-top-n and/or --kmer-plot-bottom-n. This way, we can e.g. compare the top 1000 scoring (--kmer-plot-top-n 1000) with the bottom 1000 scoring (--kmer-plot-bottom-n 1000) sites.

Helper scripts

Various helper scripts are included as well on the command line:

batch_get_common_dataset_gene_ids.py
bed_extend_regions.py
bed_merge_ol_reg.py
bed_print_first_n_pos.py
bed_print_last_n_pos.py
bed_shift_regions.py
create_custom_meme_motif_db.py
get_genomic_conservation_scores.py
gtf_extract_exon_intron_border_bed.py
gtf_extract_exon_intron_region_bed.py
gtf_extract_gene_region_bed.py
gtf_extract_mpt_region_bed.py
gtf_extract_tr_feat_bed.py
gtf_extract_transcript_data.py
gtf_get_gene_region_nt_freqs.py
gtf_get_gene_transcripts.py
gtf_get_mpt_nt_freqs.py
gtf_get_mpt_with_introns_nt_freqs.py

You can call their help pages to get more infos on what they do and how to use them (e.g., bed_merge_ol_reg.py -h). To get a quick overview: batch_get_common_dataset_gene_ids.py extracts gene IDs which occur in all datasets, given the gene_region_occupancies.tsv file from RBPBench batch output folder. bed_extend_regions.py extends genomic regions in a BED file by a given number of nucleotides up- and downstream. bed_merge_ol_reg.py takes a BED file and merges bookend or overlapping regions outputs the merged regions to a new BED file. bed_print_first_n_pos.py prints the first n positions of each region from the provided BED file. bed_print_last_n_pos.py prints the last n positions of each region from the provided BED file. bed_shift_regions.py shifts all regions from the provided BED file a given number of nucleotides up- or downstream. get_genomic_conservation_scores.py extracts conservation scores for a set of genomic regions of interest (given via --in BED). create_custom_meme_motif_db.py as described above generates a custom sequence motifs database which can be used in all search modes via --custom-db. gtf_extract_exon_intron_border_bed.py extracts exon-intron border positions from a GTF file and stores them in a BED file, which can be used as input e.g. in rbpbench nemo. gtf_extract_exon_intron_region_bed.py extracts exon or intron regions from a GTF file and stores them in a BED file. gtf_extract_gene_region_bed.py extracts gene regions from a GTF file and stores them in a BED file. gtf_extract_mpt_region_bed.py extracts most prominent transcript regions from a GTF file and stores them in a BED file. Additionally, mRNA regions (5'UTR, CDS, 3'UTR) can be output to a separate BED file. gtf_extract_tr_feat_bed.py extracts transcript feature regions from a GTF file and stores them in a BED file (e.g. stop_codon). gtf_extract_transcript_data.py extracts transcript data from a GTF file, such exon + intron regions, mRNA regions and transcript sequences. gtf_get_gene_region_nt_freqs.py calculates nucleotide frequencies from all gene regions extracted from a GTF and the corresponding genome FASTA. gtf_get_gene_transcripts.py extracts for a given gene list all transcripts from a GTF file and stores the transcript information in a TSV file. FIMO can be given this information as a nucleotide frequencies file (see options --fimo-ntf-file, --fimo-ntf-mode). gtf_get_mpt_nt_freqs.py calculates nucleotide frequencies of from all most prominent transcript (MPT) sequences (introns excluded) extracted from a GTF and the corresponding genome FASTA. gtf_get_mpt_with_introns_nt_freqs.py calculates nucleotide frequencies of from all most prominent transcript (MPT) sequences (introns included) extracted from a GTF and the corresponding genome FASTA.

About

Evaluate CLIP-seq and other genomic region data using a comprehensive collection of known RBP binding motifs

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages