GitHub

This highlight the work I did to do a benchmarking test on existing software to identify and mask repeats

In the mid 1960's scientists discovered that many genomes contain stretches of highly repetitive DNA sequences ( see Reassociation Kinetics Experiments, and C-Value Paradox ). These sequences were later characterized and placed into five categories:

Simple Repeats - Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
Tandem Repeats - Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
Segmental Duplications - Large blocks of 10-300 kilobases which are that have been copied to another region of the genome.
Interspersed Repeats
- Processed Pseudogenes, Retrotranscripts, SINES - Non-functional copies of RNA genes which have been reintegrated into the genome with the assitance of a reverse transcriptase.
- DNA Transposons -
- Retrovirus Retrotransposons -
- Non-Retrovirus Retrotransposons ( LINES ) -

Currently up to 50% of the human genome is repetitive in nature and as improvements are made in detection methods this number is expected to increase.

Key steps for benchmarking:

Define the task: virus detection, selection inference, taxonomic assignment, etc.
Select test datasets: simulated (ground truth known), real-world (biological complexity).
Choose toolset: relevant, baseline or newly developed tools.
Establish evaluation metrics: accuracy, false discovery rate, runtime, parameter sensitivity.
Visualize results: ROC curves, barplots, runtime vs. accuracy trade‑offs.
Assess parameter optimization: default vs. fine‑tuned thresholds.
Ensure reproducibility: provide code, containerized environments, public datasets.

Core Components of Benchmarking Studies

Gold Standard Data
Evaluation Metrics a) Accuracy (precision, recall, F1 score) b) Specificity, sensitivity c) Runtime (CPU time), memory (RAM) d) Pareto frontier for multi-objective performance ranking - The Pareto frontier is the set of tools that are not outperformed in all evaluation metrics by any other tool.
Parameter Optimization a) Tuning parameters is essential but computationally expensive. b) Many benchmarking studies still rely on default parameters, which can be suboptimal.

Important:

Include a comprehensive tool list and log excluded tools.
Document data preparation thoroughly.
Choose proper evaluation metrics and share them as reusable scripts.
Perform parameter optimization to fairly assess tool performance.
Summarize tool features and provide commands and installation instructions.
Unify output formats for easier comparison.
Share all benchmarking data and scripts via user-friendly interfaces.

Confusion matrix:

	Predicted: Positive	Predicted: Negative
Actual: Positive	True Positive (TP)	False Negative (FN)
Actual: Negative	False Positive (FP)	True Negative (TN)

Metric	Formula	What it tells you
Precision	TP / (TP + FP)	How many predicted positives were correct
Recall (Sensitivity)	TP / (TP + FN)	How many real positives were found
Specificity	TN / (TN + FP)	How many real negatives were correctly avoided
Accuracy	(TP + TN) / (TP + TN + FP + FN)	Overall correctness of predictions
F1 Score	2 × (Precision × Recall) / (Precision + Recall)	Balance between precision and recall
Matthews Correlation Coefficient (MCC)	A correlation-like score between -1 and 1	Best when classes are imbalanced

Always report multiple metrics, not just accuracy—because one metric (like accuracy) can be misleading, especially with unbalanced datasets.
Use F1-score or MCC for balanced performance insights.
Report performance across different thresholds (ROC, PR curves), not just at one fixed cutoff.
If the tool returns continuous predictions (like probabilities), plot: - ROC curve (True Positive Rate vs False Positive Rate) - Precision-Recall curve (Precision vs Recall)

This helps show performance over a range of sensitivity/specificity trade-offs.

To benchmark tools properly:

Run them on a dataset with known ground truth.
Fill in the confusion matrix for each tool.
Calculate multiple metrics from the matrix.
Compare across tools using these metrics.
Prefer tools on the Pareto frontier (good balance between metrics like F1, runtime, etc.).

Computational cost:

Component of Computational Cost	What It Means	How to Measure It	Example
Execution Time (Wall Time)	Total time from start to finish (including I/O)	`time` or `/usr/bin/time -v` Cluster job logs (e.g., Slurm)	Tool A took 2 hours to finish annotating the genome
CPU Time	Actual time the CPU worked on the task	`/usr/bin/time -v` → "User time" and "System time" or job log summary	Tool B used 5 hours CPU time (across 4 cores)
Maximum RAM Usage	Peak memory used during execution	`/usr/bin/time -v` → “Maximum resident set size” Cluster memory logs	Tool C required 42 GB RAM at its peak
Number of Threads / Cores Used	Degree of parallelism the tool supports	Tool documentation + resource monitoring during runtime	Tool D used 8 threads for faster performance
Disk Usage	Temporary and output storage space required	Measure input/output file size + disk usage during run (e.g., `du -sh ./`)	Tool E created 90 GB of temporary output files
Installation Burden	Time and effort required to install all dependencies	Manual notes or user experience (Optional rating: Easy/Moderate/Hard)	Tool F needed Conda + 5 external packages, took 1 hour to set up
Scalability	How well the tool handles larger datasets	Test with small vs large input and compare resource scaling	Tool G crashed on whole genome but worked on 1 chromosome

How do we choose gold standard data in this case?

use hg19 masked repeats? they use repeatmasker but i didn't intend to include it as RepeatModeler is based on it

Plan: The input is a .out file produced by RepeatMasker when using a tool-generated repeat library to mask a curated reference genome. Eukaryotic genomes contain complex structure and tandem repeats, which may result in false positives for TE discovery software. We assessed the false positive rate of RepeatModeler2 by running it on artificially generated genomes devoid of TEs simulated by GARLIC (43) for D. melanogaster and D. rerio. GARLIC generates background sequences with realistic complexity, isochore structure, and tandem repeat content similar to the modeled genome. RepeatModeler2 produced only one false positive family on the D. melanogaster artificial genome and five false positive families on the D. rerio artificial genome. No false positive families were produced from the LTR module. These results suggest that the rate of false positives generated by RepeatModeler2 is very low.(benchmarking done on RM1 vs RM2)

Choosing the software to use for benchmarking: Task: provide a critical comparative assessment of the developed tool relative to existing, widely used tools such as RepeatModeler, RepeatExplorer, RepeatScout, PILER, Tandem Repeats Finder, RECON, and others. A thorough benchmarking analysis—demonstrating where this method improves upon or complements established tools in terms of algorithmic performance, repeat type detection, accuracy, and computational efficiency—is essential.

Including: RepeatModeler (includes RepeatScout and RECON, as well as LTR_retriever and LTRharvest) ✅ RepeatExplorer ✅ PILER - wasn't updated since 2005 https://academic.oup.com/bioinformatics/article/21/suppl_1/i152/202952?login=true Tandem Repeats Finder - only for tandem repeats

Tasks:

find other software to use for comparison
choose sequence to run (needs to be raw and assembled)
write script to compare output of RepeatMasker with other software
Describe additional features of each software
Record time it takes for each software to run

ool facts I found while reading about repeats:

For instance, about 50% of the human genome consists of repeats, while roughly 4% of human genes harbor transposable elements in their protein-coding regions. Because many of these repeats (~89.5%) are located within introns, they have been erroneously assumed to be non-functional. However, increasing research indicates the significant impacts that repeats in coding and noncoding regions can have on evolution, gene expression regulation, and variation induction. For example, when repeats are present in the coding region they get translated canonically. Not only can non-coding repeats be translated by a non-canonical mechanism, but even the telomeric repeat RNAs can get translated. Moreover, recent studies have shown that such repeats are closely related to a variety of diseases, such as genetic disorders (e.g., Hemophilia), neurological diseases (e.g., poly-Q diseases), and cancers (e.g., endometrial, stomach and colorectal cancers) In genetics, tandem repeats occur in DNA when a pattern of one or more nucleotides is repeated and the repetitions are directly adjacent to each other, e.g. ATTCG ATTCG ATTCG, in which the sequence ATTCG is repeated three times.

Here’s a comprehensive comparison between RepeatModeler and RepeatExplorer, based on documentation and standard bioinformatics knowledge:

🧩 FUNCTIONALITY

Feature	RepeatModeler	RepeatExplorer
Goal	De novo identification and classification of repeats	De novo identification, quantification, and annotation of repeats
Repeat types	All repeat families (transposons, satellites, etc.)	Mainly focused on high-copy repeats, including satellites and LTRs
Repeat masking support	Yes — integrates with RepeatMasker	No direct masking, but produces sequences usable by RepeatMasker
Quantification of genome proportion	Partial (inferred)	Yes — calculates how much of genome each repeat occupies
Graph-based clustering	No	Yes — unique approach using graph layout of read similarity
Satellite-specific detection (TAREAN)	No	Yes — specialized module for tandem repeat analysis
Custom repeat database annotation	Yes — uses RepeatClassifier, integrates with RepBase	Yes — matches clusters against REXdb, rDNA, and optional custom databases

📥 INPUT FORMAT

Format	RepeatModeler	RepeatExplorer
Input type	Assembled genome in FASTA format	Unassembled genomic reads (FASTQ or FASTA)
Genome size	Suitable for assembled draft/finished genomes	Suitable for large genome surveys or unassembled genomes
Paired-end reads	Not used	Yes — improves clustering with paired-end reads

📤 OUTPUT FORMAT

Output type	RepeatModeler	RepeatExplorer
Repeat library	FASTA file of consensus sequences	FASTA file of consensus sequences (contigs.fasta, TAREAN ranks)
Classification	Yes — RepeatClassifier	Yes — similarity hits + annotation
RepeatMasker-ready library	Yes	Indirectly (can export sequences to use with RM)
HTML reports	No	Yes — Interactive cluster and supercluster reports
Graph visualization	No	Yes — detailed read similarity graphs for each cluster
Genome proportion table	No direct output	Yes — CLUSTER_TABLE.csv and SUPERCLUSTER_TABLE.csv

🧪 ADDITIONAL FUNCTIONS

Function	RepeatModeler	RepeatExplorer
Integration with RepeatMasker	Full integration	Not directly integrated, but outputs compatible sequences
Tandem repeat annotation (TAREAN)	❌ Not available	✅ Yes, unique strength
LTR structure detection	Partial (via LTRHarvest in RM2)	Yes, built-in LTR detection with PBS check
Paired-read coherence (P index)	❌	✅ Yes

🏛 YEAR OF CREATION & NOTABLE POINTS

Feature	RepeatModeler	RepeatExplorer
First release	~2008–2009 (RepeatModeler 1), RM2 in 2020	~2013 (TAREAN added in 2017)
Developed by	Institute for Systems Biology (ISB)	Institute of Plant Molecular Biology, Czech Academy
Notable strength	Gold standard for de novo repeat library building	Best for repeat profiling in unassembled genomes
Scalability	Works well on complete genomes	Ideal for low-coverage surveys, fast to run
Graph-based novelty	❌ Traditional homology/structure based	✅ Graph-based clustering is a key innovation

✅ ADVANTAGES AND ❌ DISADVANTAGES

🔹 RepeatModeler

Advantages:

Works on complete assemblies
High-quality consensus sequences
Integrates well with RepeatMasker
Broad repeat classification

Disadvantages:

Requires a full genome assembly
Doesn’t quantify genome proportion
No visualization or tandem repeat-specific features

🔸 RepeatExplorer

Advantages:

Works directly on raw reads (no assembly required)
Estimates abundance of repeat families
Graph-based clustering reveals relationships
TAREAN module gives rich satellite repeat analysis
Interactive HTML reports for exploration

Disadvantages:

Doesn’t produce a masked genome
Not designed for building full repeat libraries for RepeatMasker
Slower on huge datasets or very complex genomes

🧠 Summary

Use Case	Recommended Tool
You have an assembled genome	✅ RepeatModeler
You have only reads / no assembly	✅ RepeatExplorer
You need a RepeatMasker-ready library	✅ RepeatModeler
You want repeat quantification	✅ RepeatExplorer
You’re studying tandem/satellite repeats	✅ RepeatExplorer (TAREAN)
You want graph-based visualization	✅ RepeatExplorer

Main metrics from RepeatModeler’s benchmarking: Runtime ratio of families found to total sequences in the library, indicating less fragmentation and redundancy. Comparison of family quality (perfect, good, present, not found)

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
README.md		README.md
RExplorer.md		RExplorer.md
RMasker.md		RMasker.md
RModeler.md		RModeler.md
TotalRepeats.md		TotalRepeats.md
WMasker.md		WMasker.md
artificial_data.md		artificial_data.md
comparison.md		comparison.md
running_on_server.md		running_on_server.md
seqs.md		seqs.md
tutorial_singularity.md		tutorial_singularity.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

This highlight the work I did to do a benchmarking test on existing software to identify and mask repeats

🧩 FUNCTIONALITY

📥 INPUT FORMAT

📤 OUTPUT FORMAT

🧪 ADDITIONAL FUNCTIONS

🏛 YEAR OF CREATION & NOTABLE POINTS

✅ ADVANTAGES AND ❌ DISADVANTAGES

🔹 RepeatModeler

🔸 RepeatExplorer

🧠 Summary

About

Uh oh!

Releases

Packages

aigerabae/repeats_benchmarking

Folders and files

Latest commit

History

Repository files navigation

This highlight the work I did to do a benchmarking test on existing software to identify and mask repeats

🧩 FUNCTIONALITY

📥 INPUT FORMAT

📤 OUTPUT FORMAT

🧪 ADDITIONAL FUNCTIONS

🏛 YEAR OF CREATION & NOTABLE POINTS

✅ ADVANTAGES AND ❌ DISADVANTAGES

🔹 RepeatModeler

🔸 RepeatExplorer

🧠 Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages