This C++ program identifies sequencing reads from a CRAM (or BAM) file likely to be derived from long alleles of short tandem repeats (STRs). The approach is similar to ExpansionHunter Denovo (Dolzhenko et al. 2020) with a few optimizations to improve computational efficency. In brief, the extractLongSTRs program scans all reads (regardless of mapping location) for evidence of periodicity and identifies highly repetitive reads (at most 10 mismatches between base i and base i+k for some period k between 2-6bp). Sequence and alignment information about these reads is recorded to an output file for downstream processing.
Details are provided in a forthcoming revision of our manuscript, Hujoel et al. bioRxiv.
- Linux operating system (tested on Ubuntu 20.04, CentOS 7, RHEL 9)
- ~0.1 GB RAM
- ~8 mins of computation for a 30x WGS CRAM file
Two precompiled binary executables are provided:
extractLongSTRs_no_curl(should be portable to most Linux systems; cannot directly take a remote CRAM/BAM URL as input, but this can be circumvented using process substitution as in the example below)extractLongSTRs(links libcurl but is less portable)
Alternatively, if you wish to compile the C++ source code, you can do so easily if HTSlib is already installed:
g++ -O3 -fpermissive ExtractLongSTRs.cpp -o extractLongSTRs -lhts -lz -lpthreadIf HTSlib is not in your default paths, you may need to specify -I and -L flags for includes and linking (see below).
If HTSlib is not already installed, you will need to download and install it first. To maximize performance, we recommend configuring it with libdeflate (which can be installed via sudo apt --yes install libdeflate-dev or by downloading and building libdeflate manually). To use libdeflate but minimize other dependencies, HTSlib can be configured using the following flags:
./configure --disable-bz2 --disable-lzma --disable-plugins --disable-gcs --disable-s3 --with-libdeflateNon-default paths to libdeflate can be specified in the HTSlib configuration as follows:
./configure CPPFLAGS=-I/path/to/libdeflate LDFLAGS=-L/path/to/libdeflate --disable-bz2 --disable-lzma --disable-plugins --disable-gcs --disable-s3 --with-libdeflate
makeAfter installing libdeflate and HTSlib, the following command will compile the C++ source code and generate the extractLongSTRs binary executable:
g++ -O3 -fpermissive ExtractLongSTRs.cpp -o extractLongSTRs -I/path/to/htslib/ -L/path/to/htslib/ -L/path/to/libdeflate/ -lhts -lcurl -ldeflate -lz -lpthread./extractLongSTRs <input.cram|input.bam> <reference.fasta> <output.tsv><input.cram|input.bam>: Input file (CRAM or BAM)<reference.fasta>: Reference genome (required for CRAM; ignored for BAM)<output.tsv>: Output file path
GRCh38 reference fasta and fasta index files can be downloaded from the following locations:
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.faiNote that even for BAM input, a second input argument must be provided but will be ignored (so any input string can be supplied).
The output is a tab-delimited file with the following columns:
| Column | Description |
|---|---|
bpDelta |
Optimal motif periodicity (2 to 6 bases) |
mismatches |
Mismatch count at the optimal periodicity |
QNAME |
Query name (read ID) |
FLAG |
Bitwise SAM flag |
RNAME |
Reference sequence name |
POS |
1-based leftmost mapping position |
QUAL |
Mapping quality |
RNEXT |
Mate reference sequence name |
PNEXT |
Mate position |
SEQ |
Read sequence (decoded from 4-bit encoding) |
The 3rd and subsequent columns are obtained directly from the CRAM/BAM file.
The following command runs the software on a publicly available 1000 Genomes Project 30x WGS CRAM file for HG00118, streamed from a remote URL:
./extractLongSTRs http://1000genomes-dragen-v4.0.3.s3.us-east-1.amazonaws.com/data/1kgp_cram/1000G_2504_high_coverage/ERR3240193/HG00118.final.cram GRCh38_full_analysis_set_plus_decoy_hla.fa example_output_HG00118.tsvThe same example can be run using the (more portable) extractLongSTRs_no_curl precompiled binary executable via process substitution:
./extractLongSTRs_no_curl <(curl -s http://1000genomes-dragen-v4.0.3.s3.us-east-1.amazonaws.com/data/1kgp_cram/1000G_2504_high_coverage/ERR3240193/HG00118.final.cram) GRCh38_full_analysis_set_plus_decoy_hla.fa example_output_HG00118.tsvSample output:
Processed 811912061 reads; wrote 14871 to example_output_HG00118.tsv (493.074 sec)
The mismatch threshold is currently hardcoded as:
int maxMismatches = 10;This can be edited in the ExtractLongSTRs.cpp source file if desired, and similarly, SAM flag filters can be modified as well. The program currently excludes secondary, supplementary, and QC-failed QC reads (but not duplicate reads, as in-repeat read (IRR) pairs are sometimes marked as duplicates).
GNU General Public License version 3 (GPLv3)