GDBr : Genomic signature interpretation tool for DNA double-strand break repair mechanism

Refer to this paper for further details and citation.

GDBr (pronounced "Genome Debugger") is a tool designed to annotate genetic variants with their underlying double-strand break (DSB) repair mechanisms using long-read-based genome assemblies. The annotation process in GDBr consists of three key steps:

Preprocessing Step

In this initial step, contig-level genome assemblies (the Query) are scaffolded into chromosome-level assemblies using RagTag, followed by variant calling using SVIM-asm.

Correction Step

During this step, each genetic variant is searched for in both the reference and query genomes using BLAST. Repetitive variants are filtered out using TRF and RepeatMasker. Additionally, micro/homology signatures of variants are detected at this stage.

Annotation Step

In the final step, micro/homology distributions are separated, and potential DSB repair mechanisms are annotated for each variant.

You need only reference sequence and query sequences file to use GDBr.

Install

We strongly recommend using conda package manager to install GDBr.

conda create -n GDBr -c conda-forge -c bioconda -c chemical118 gdbr
conda activate GDBr
gdbr --version

Also, you can use mamba package mamager to install GDBr quickly.

mamba create -n GDBr -c conda-forge -c bioconda -c chemical118 gdbr
mamba activate GDBr
gdbr --version

Quick Start

For achieving accurate results , we require references and pangenomes generated from long-read sequencing. It is recommended that reference is assembled at the chromosome-level, and query should be assembled at the scaffold-level. Also, SSD is not necessary to run GDBr, as it only improves the processing speed by approximately 3%.

gdbr analysis -r <reference.fa> -q <query1.fa query2.fa ...> -s <species of data> -t <number of threads>

Example

mkdir gdbr_test
cd gdbr_test

# download reference
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz

# download pangenome
wget s3://human-pangenomics/working/HPRC_PLUS/HG002/assemblies/year1_f1_assembly_v2_genbank/HG002.paternal.f1_assembly_v2_genbank.fa.gz
wget s3://human-pangenomics/working/HPRC_PLUS/HG005/assemblies/year1_f1_assembly_v2_genbank/HG005.paternal.f1_assembly_v2_genbank.fa.gz

# decompress genome
gzip -d chm13v2.0.fa.gz
gzip -d HG002.paternal.f1_assembly_v2_genbank.fa.gz
gzip -d HG005.paternal.f1_assembly_v2_genbank.fa.gz

# install GDBr
conda create -n GDBr -c conda-forge -c bioconda -c chemical118 gdbr
conda activate GDBr

# run GDBr
gdbr analysis -r chm13v2.0.fa -q HG002.paternal.f1_assembly_v2_genbank.fa HG005.paternal.f1_assembly_v2_genbank.fa -s human -o gdbr_output -t 10

Steps of GDBr

The above command executes the following three processes concurrently. If you want to redo some of the processes, you can manually run the command below.

Preprocess

By using RagTag and svim-asm, GDBr preprocess data and return properly scaffolded query .fa sequence file and variant .vcf file.

gdbr preprocess -r <reference.fa> -q <query1.fa query2.fa ...> -o prepro -t <number of threads>

The preprocess step, utilizing a sorting program for scaffolding and variant calling, often underutilizes allocated threads. To address this, an optimization approach distributes multiple queries across a reduced number of threads, enhancing efficiency but significantly increasing memory usage. In response, GDBr offers the --low_memory option , providing users with the flexibility to selectively apply this optimization based on their specific resource constraints.

Correct

By using BLAST, GDBr correct the variant file to analysis DSBR accurately. And, filter the repeat by using TRF, RepeatMasker.

gdbr correct -r <reference.fa> -q prepro/query/*.GDBr.preprocess.fa -v prepro/vcf/*.GDBr.preprocess.vcf -s <species of data> -o sv -t <number of threads>

Analysis

GDBr analysis the variant and identify DSBR mechanism.

gdbr analysis -r <reference.fa> -q prepro/query/*.GDBr.preprocess.fa -v sv/*.GDBr.correct.csv -o dsbr -t <number of threads>

You can turn on different locus DSBR analysis by --diff_locus_dsbr_analysis, however analysis can give false positives due to partial homology on the sex chromosomes.

Final output

GDBr's final ouput is <query basename>.GDBr.result.tsv. This is simple description of the final output.

Field	Description
ID	GDBr.<query order>.<variant order>
CALL_TYPE	Variant type : INS, DEL, etc
SV_TYPE	Corrected variant type : INS, DEL, SUB, etc
CHR	variant chromosome
REF_START	variant reference start location
REF_END	variant reference end location
QRY_START	variant query start location
QRY_END	variant query end location
GDBR_TYPE	GDBr variant type
HOM_LEN/HOM_START_LEN	INDEL : homology length / SUB : left homology length
HOM_END_LEN	SUB : right homology length
TEMP_INS_SEQ_LOC	templated insertion sequence location (REF or QRY)
DSBR_CHR	different locus DSBR chromosome
DSBR_START	different locus DSBR start
DSBR_END	different locus DSBR end
HOM_SEQ/HOM_START_SEQ	INDEL : homology sequence / SUB : left homology sequence
HOM_END_SEQ	SUB : right homology sequence
PUTATIVE_MECHANISM	GDBr DSB repair putative mechanism

Benckmarking

You can benchmark any command in GDBr with the --benchmark option by GNU time and psutil. It provides user time, system time, average CPU usage, multiprocessing efficiency, maximum RAM usage and wall clock time.

...
[2023-08-18 13:44:16] GDBr benchmark complete
User time (seconds) : 8007.44
System time (seconds) : 19901.00
Percent of CPU this job got : 7267%
Multiprocessing efficiency : 0.5118
Wall clock time (h:mm:ss or m:ss) : 6:23.99
Max memory usage (GB) : 23.5805

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github/workflows		.github/workflows
gdbr		gdbr
logo		logo
test		test
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GDBr : Genomic signature interpretation tool for DNA double-strand break repair mechanism

Preprocessing Step

Correction Step

Annotation Step

Install

Quick Start

Example

Steps of GDBr

Preprocess

Correct

Analysis

Final output

Benckmarking

About

Uh oh!

Releases 14

Uh oh!

Languages

License

Chemical118/GDBr

Folders and files

Latest commit

History

Repository files navigation

GDBr : Genomic signature interpretation tool for DNA double-strand break repair mechanism

Preprocessing Step

Correction Step

Annotation Step

Install

Quick Start

Example

Steps of GDBr

Preprocess

Correct

Analysis

Final output

Benckmarking

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 14

Uh oh!

Languages