Refer to this paper for further details and citation.
GDBr (pronounced "Genome Debugger") is a tool designed to annotate genetic variants with their underlying double-strand break (DSB) repair mechanisms using long-read-based genome assemblies. The annotation process in GDBr consists of three key steps:
In this initial step, contig-level genome assemblies (the Query) are scaffolded into chromosome-level assemblies using RagTag, followed by variant calling using SVIM-asm.
During this step, each genetic variant is searched for in both the reference and query genomes using BLAST. Repetitive variants are filtered out using TRF and RepeatMasker. Additionally, micro/homology signatures of variants are detected at this stage.
In the final step, micro/homology distributions are separated, and potential DSB repair mechanisms are annotated for each variant.
You need only reference sequence and query sequences file to use GDBr.
We strongly recommend using conda package manager to install GDBr.
conda create -n GDBr -c conda-forge -c bioconda -c chemical118 gdbr
conda activate GDBr
gdbr --versionAlso, you can use mamba package mamager to install GDBr quickly.
mamba create -n GDBr -c conda-forge -c bioconda -c chemical118 gdbr
mamba activate GDBr
gdbr --versionFor achieving accurate results , we require references and pangenomes generated from long-read sequencing. It is recommended that reference is assembled at the chromosome-level, and query should be assembled at the scaffold-level. Also, SSD is not necessary to run GDBr, as it only improves the processing speed by approximately 3%.
gdbr analysis -r <reference.fa> -q <query1.fa query2.fa ...> -s <species of data> -t <number of threads>mkdir gdbr_test
cd gdbr_test
# download reference
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/analysis_set/chm13v2.0.fa.gz
# download pangenome
wget s3://human-pangenomics/working/HPRC_PLUS/HG002/assemblies/year1_f1_assembly_v2_genbank/HG002.paternal.f1_assembly_v2_genbank.fa.gz
wget s3://human-pangenomics/working/HPRC_PLUS/HG005/assemblies/year1_f1_assembly_v2_genbank/HG005.paternal.f1_assembly_v2_genbank.fa.gz
# decompress genome
gzip -d chm13v2.0.fa.gz
gzip -d HG002.paternal.f1_assembly_v2_genbank.fa.gz
gzip -d HG005.paternal.f1_assembly_v2_genbank.fa.gz
# install GDBr
conda create -n GDBr -c conda-forge -c bioconda -c chemical118 gdbr
conda activate GDBr
# run GDBr
gdbr analysis -r chm13v2.0.fa -q HG002.paternal.f1_assembly_v2_genbank.fa HG005.paternal.f1_assembly_v2_genbank.fa -s human -o gdbr_output -t 10The above command executes the following three processes concurrently. If you want to redo some of the processes, you can manually run the command below.
By using RagTag and svim-asm, GDBr preprocess data and return properly scaffolded query .fa sequence file and variant .vcf file.
gdbr preprocess -r <reference.fa> -q <query1.fa query2.fa ...> -o prepro -t <number of threads>The preprocess step, utilizing a sorting program for scaffolding and variant calling, often underutilizes allocated threads. To address this, an optimization approach distributes multiple queries across a reduced number of threads, enhancing efficiency but significantly increasing memory usage. In response, GDBr offers the --low_memory option , providing users with the flexibility to selectively apply this optimization based on their specific resource constraints.
By using BLAST, GDBr correct the variant file to analysis DSBR accurately. And, filter the repeat by using TRF, RepeatMasker.
gdbr correct -r <reference.fa> -q prepro/query/*.GDBr.preprocess.fa -v prepro/vcf/*.GDBr.preprocess.vcf -s <species of data> -o sv -t <number of threads>GDBr analysis the variant and identify DSBR mechanism.
gdbr analysis -r <reference.fa> -q prepro/query/*.GDBr.preprocess.fa -v sv/*.GDBr.correct.csv -o dsbr -t <number of threads>You can turn on different locus DSBR analysis by --diff_locus_dsbr_analysis, however analysis can give false positives due to partial homology on the sex chromosomes.
GDBr's final ouput is <query basename>.GDBr.result.tsv. This is simple description of the final output.
| Field | Description |
|---|---|
| ID | GDBr.<query order>.<variant order> |
| CALL_TYPE | Variant type : INS, DEL, etc |
| SV_TYPE | Corrected variant type : INS, DEL, SUB, etc |
| CHR | variant chromosome |
| REF_START | variant reference start location |
| REF_END | variant reference end location |
| QRY_START | variant query start location |
| QRY_END | variant query end location |
| GDBR_TYPE | GDBr variant type |
| HOM_LEN/HOM_START_LEN | INDEL : homology length / SUB : left homology length |
| HOM_END_LEN | SUB : right homology length |
| TEMP_INS_SEQ_LOC | templated insertion sequence location (REF or QRY) |
| DSBR_CHR | different locus DSBR chromosome |
| DSBR_START | different locus DSBR start |
| DSBR_END | different locus DSBR end |
| HOM_SEQ/HOM_START_SEQ | INDEL : homology sequence / SUB : left homology sequence |
| HOM_END_SEQ | SUB : right homology sequence |
| PUTATIVE_MECHANISM | GDBr DSB repair putative mechanism |
You can benchmark any command in GDBr with the --benchmark option by GNU time and psutil. It provides user time, system time, average CPU usage, multiprocessing efficiency, maximum RAM usage and wall clock time.
...
[2023-08-18 13:44:16] GDBr benchmark complete
User time (seconds) : 8007.44
System time (seconds) : 19901.00
Percent of CPU this job got : 7267%
Multiprocessing efficiency : 0.5118
Wall clock time (h:mm:ss or m:ss) : 6:23.99
Max memory usage (GB) : 23.5805