ribbit

Important

Ribbit is under active development. Please refer to Change log and make sure to get the latest version of the repository.

ribbit

Ribbit identifies tandem repeat (TR) regions of variable motif sizes in DNA sequences. Tandem repeats are DNA segments consisting of two or more nearly identical copies of a motif occurring contiguously. Ribbit is designed to improve motif decomposition and resolve complex repeat structures, accurately detecting TRs with motif sizes up to 100 bp.

The algorithm converts DNA sequences to 2-bit format and uses basic bit operations to deliniate potential repetitive stretches of a certain periodicity. The DNA sequence of a potential repetitive stretch is decomposed to identify a representative motif of identified peridicity. There are two different approaches to identifying the representative motif depending the on the peridicity. The sequence in the potential stretch is aligned with the perfect repeat of the identified representative motif to calculate the purity of the repeat. The potential sequence is either trimmed or dropped based on the user desired purity and minimum repeat length thresholds. A given sequence is idependently processed for potential repeats of all user-desired motif sizes. The results from each idependent search are merged and compared to resolve for nested/overlapping interpretations of a sequence as repetitive sequence of different periodicities. Overlapping tandem repeat interpretations of different motif sizes are retained/dropped based on preference for more pure stretches.

The conversion of DNA to 2-bit stretches results in fast identification of potential repetitive stretches and allows the time for careful motif decomposition of repeat sequence. Ribbit provides a comprehensive bed file as an output and takes about 5-7 secs to resolve an MB of DNA sequence. The program can also be run on multi-threaded mode making it ideal for processing large genomes.

Compiling

To compile Ribbit, please follow these instructions:

1. Installing dependencies

sudo apt-get install boost
sudo apt-get install zlib1g-dev

2. Instruction for compiling

git clone https://github.com/SowpatiLab/ribbit.git
git checkout dev
cd ribbit
make

Usage

Here are some basic usage examples:

$ ./ribbit [options] -i sequence.fasta 

# with output file provided
$ ./ribbit [options] -i sequence.fasta -o results.bed

# pip input from standard input
$ cat sequence.fasta | ./ribbit [options] -i - 
$ echo "ATgcatgcGGAGGAGGAGGAGGAGGAcagtcgata" | ./ribbit -i -

To view detailed help information

./ribbit -h

Ribbit: identification of tandem repeats and annotation of complex TRs in genomes
Version: 1.0.2

Options for running the tool:
  -h [ --help ]                 Ribbit detects tandem repeat regions in DNA, accurately resolving 
                                complex repeat structures and motif sizes up to 100 bp.
  --version                     Prints out the version of ribbit.
  -i [ --input-file ] arg       Input sequence. Can be a fasta file (optionally gzipped) or '-' for
                                stdin.
  -o [ --output-file ] arg      File path for output file. Default: stdout
  -m [ --min-motif-length ] arg The minimum length of the motif of the TR loci. [int] Default: 2
  -M [ --max-motif-length ] arg The maximum length of the motif of the TR loci. [int] Default: 100
  -p [ --min-purity ] arg       The minimum allowed purity of repeat sequence. Purity is calculated
                                as the (matches/(matches+mismatches+indels)) in the alignment of 
                                region sequence to perfect repeat of consensus motif. [float] 
                                Default: 0.8
  -q [ --min-motif-purity ] arg Minimum purity of each motif with consensus motif. Calculated as 
                                the average of (matches/(matches+mismatches+indels)) for each motif
                                length in the alignment of region sequence to perfect repeat of 
                                consensus motif. [float] Default: 0.8
  -l [ --min-length ] arg       The minimum length of the repeat. Input can be an integer or a 
                                tab-separated file with two columns of motif length and the length 
                                cutoff. [int or file] Default: 12 for STRs (motif length <= 6), 
                                2*(motif length) for others.
  --min-units arg               The minimum number of units of the repeat. Input can be a integer 
                                or a tab-separated file with two columns, first is the motif size 
                                and second unit cutoff. [int or file] Default: 2 for all motif 
                                sizes.
  --perfect-units arg           The minimum number of complete units with 100% match with the 
                                consensus motif in the repeat. Input can be an integer or a 
                                tab-separated file with two columns of the motif length and the 
                                unit cutoff. [int or file] Default: 2
  --cigar                       Include cigar string of the alignment of the sequence with the 
                                perfect repeat of the consensus motif in the output. Default: 
                                false.
  -t [ --threads ] arg          Number of threads to be used for running. [int] Default: 1

Options description

1. Repeat purity `-p or --min-purity`

After identifying a potential repetitive region with a given periodicity, Ribbit determines the consensus motif of the tandem repeat. The sequence is then aligned to a perfect tandem repeat generated from this consensus motif. Repeat purity is calculated as the number of matching bases divided by the alignment length between the sequence and the perfect repeat. The --min-purity option allows users to set the minimum purity threshold for tandem repeat detection. It accepts a floating-point value between 0 and 1.

$purity = matches / (matches + mismatches + indels)$

2. Motif purity `-q or --min-motif-purity`

Motif purity is purity calculated for each motif stretch and is then averaged across all the motifs. --min-motif-purity option allows users to set the minimum threshold for a TR. This is used to trim the motifs of lesser purity from the edges of the repeat and report the stretch with average motif purity greater than or equal to the user defined threshold. It accepts a floating-point value between 0 and 1.

motif purity at the i^th motif is calculated as

$motif\_purity_i = matches / ((matches + mismatches + indels))$

motif purity is calculated as the average across all motif units

$motif\_purity = average(motif\_purity_0 + motif\_purity_1 +... + motif\_purity_n)$

Output

Output columns

S.No	Column	Description
1	Chromosome	chromosome or sequence name as specified in fasta header
2	Repeat Start	0-based start position of tandem repeat in the chromosome
3	Repeat Stop	end position of tandem repeat in the chromosome
4	Motif	sequence of unit motif of tandem repeat
5	Purity	purity of the repeat
6	Motif length	length of the unit motif in bases
7	Repeat Length	length of the repeat based on the coordinates
8	Repeat Units	number of unit motifs of the TR
9	Info	additional information of the repeat region including the CIGAR string of alignment and information of the nested TRs in the region.

Output file example

chromosome	start	end	motif	purity	motif_length	repeat_length	repeat_units	info
test	292	305	AC	0.93	2	13	6	I
test	481	496	AGC	0.82	3	15	5	I
test	827	843	GCCCAGGT	1.00	8	16	2	I
test	1017	1032	TGCGGAG	0.93	7	15	2	I
test	1508	1523	AGGC	0.81	4	15	3	I
test	1682	1833	TGGAGGGTGGGGCCAAATGGAAGTGGGCGGGGCTGTGG	0.90	38	151	3	I
test	1863	1890	AGGGC	0.83	5	27	5	M:1869-1890-6-0.81:AGGGGC
test	2182	2197	CCGGT	0.93	5	15	3	I
test	2277	2296	TGGCCTCC	1.00	8	19	2	I

INFO Field Description

The INFO field provides detailed information about tandem repeat (TR) structure, purity, and nested sub-repeats.
Each attribute in the field is separated by a colon (:).

Format:

<M_or_I> : <subrepeat_info> : <motifs>

if the --cigar option is enabled:

<M_or_I> : <CIGAR> : <subrepeat_info> : <motifs> : <subrepeat_CIGARs>

Attribute Description:

Attribute	Description
Type	Indicates whether the repeat is isolated (`I`) or contains nested, more pure repeats (`M`).
CIGAR (optional)	If the `--cigar` option is enables this reports the CIGAR string of the alignment of sequence with a perfect repeat of the consensus motif.
2nd – Subrepeat info	Lists subrepeats in the format `{start}-{stop}-{period}-{purity}`, separated by commas. Example: `1068-1115-6-1.00` means a subrepeat from position 1068–1115, with motif length 6 bp and 100% purity.
3rd – Motifs	Provides the motifs of the corresponding subrepeats, separated by commas.
Subrepeat CIGAR strings (optional)	If the `--cigar` option is used, lists the CIGAR strings of each subrepeat, separated by commas.

Example:

M:1068-1115-6-1.00,1103-1126-11-1.00,1114-1126-6-1.00:AACCCT,accctaaccct

#with --cigar enabled
M100M:1068-1115-6-1.00,1103-1126-11-1.00:AACCCT,accctaaccct:47M,23M

Citation

If you found ribbit useful, we would appreciate it if you could cite our manuscript: Ribbit: Accurate identification and annotation of complex tandem repeat sequences in genomes

Change log

version 1.0.3 - 11-04-2025

Removed bug to exclude N at the end of the repeat.

version 1.0.2 - 28-10-2025

Optimized alignment for seeds larger than 10kb.

version 1.0.1 - 23-10-2025

Fixed concatenating outputs in multi-threaded mode.
Taking sequence input from standard input.
Default output directed to standard output.
Fixed seed sorting for seed with identical coordinates.

version 1.0.0 - 20-10-2025

Ribbit underwent major changes in this version. The algorithmic logic for identifying the potential tandem repeats (seed TRs) has changed.
The output format has been changed. TR regions are reported with the nested nested repeats reported in the info column.

Authors

Anukrati Sharma
Akshay Kumar Avvaru

Contact

For queries or suggestions, please contact:
Akshay Kumar Avvaru - avvaruakshay@gmail.com
Divya Tej Sowpati - tej@ccmb.res.in

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
data_simulation		data_simulation
lib		lib
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
makefile		makefile
ribbit		ribbit
ribbit.cpython-312-darwin.so		ribbit.cpython-312-darwin.so

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ribbit

Table of Contents

Compiling

1. Installing dependencies

2. Instruction for compiling

Usage

Here are some basic usage examples:

To view detailed help information

Options description

1. Repeat purity `-p or --min-purity`

2. Motif purity `-q or --min-motif-purity`

Output

Output columns

Output file example

INFO Field Description

Format:

Attribute Description:

Example:

Citation

Change log

version 1.0.3 - 11-04-2025

version 1.0.2 - 28-10-2025

version 1.0.1 - 23-10-2025

version 1.0.0 - 20-10-2025

Authors

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

SowpatiLab/ribbit

Folders and files

Latest commit

History

Repository files navigation

ribbit

Table of Contents

Compiling

1. Installing dependencies

2. Instruction for compiling

Usage

Here are some basic usage examples:

To view detailed help information

Options description

1. Repeat purity -p or --min-purity

2. Motif purity -q or --min-motif-purity

Output

Output columns

Output file example

INFO Field Description

Format:

Attribute Description:

Example:

Citation

Change log

version 1.0.3 - 11-04-2025

version 1.0.2 - 28-10-2025

version 1.0.1 - 23-10-2025

version 1.0.0 - 20-10-2025

Authors

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

1. Repeat purity `-p or --min-purity`

2. Motif purity `-q or --min-motif-purity`

Packages