Skip to content

Efficiently identifies WGS reads primarily composed of short tandem repeats

Notifications You must be signed in to change notification settings

poruloh/extractLongSTRs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

ExtractLongSTRs

This C++ program identifies sequencing reads from a CRAM (or BAM) file likely to be derived from long alleles of short tandem repeats (STRs). The approach is similar to ExpansionHunter Denovo (Dolzhenko et al. 2020) with a few optimizations to improve computational efficency. In brief, the extractLongSTRs program scans all reads (regardless of mapping location) for evidence of periodicity and identifies highly repetitive reads (at most 10 mismatches between base i and base i+k for some period k between 2-6bp). Sequence and alignment information about these reads is recorded to an output file for downstream processing.

Details are provided in a forthcoming revision of our manuscript, Hujoel et al. bioRxiv.

Requirements

  • Linux operating system (tested on Ubuntu 20.04, CentOS 7, RHEL 9)
  • ~0.1 GB RAM
  • ~8 mins of computation for a 30x WGS CRAM file

Installation guide

Two precompiled binary executables are provided:

  • extractLongSTRs_no_curl (should be portable to most Linux systems; cannot directly take a remote CRAM/BAM URL as input, but this can be circumvented using process substitution as in the example below)
  • extractLongSTRs (links libcurl but is less portable)

Alternatively, if you wish to compile the C++ source code, you can do so easily if HTSlib is already installed:

g++ -O3 -fpermissive ExtractLongSTRs.cpp -o extractLongSTRs -lhts -lz -lpthread

If HTSlib is not in your default paths, you may need to specify -I and -L flags for includes and linking (see below).

If HTSlib is not already installed, you will need to download and install it first. To maximize performance, we recommend configuring it with libdeflate (which can be installed via sudo apt --yes install libdeflate-dev or by downloading and building libdeflate manually). To use libdeflate but minimize other dependencies, HTSlib can be configured using the following flags:

./configure --disable-bz2 --disable-lzma --disable-plugins --disable-gcs --disable-s3 --with-libdeflate

Non-default paths to libdeflate can be specified in the HTSlib configuration as follows:

./configure CPPFLAGS=-I/path/to/libdeflate LDFLAGS=-L/path/to/libdeflate --disable-bz2 --disable-lzma --disable-plugins --disable-gcs --disable-s3 --with-libdeflate
make

After installing libdeflate and HTSlib, the following command will compile the C++ source code and generate the extractLongSTRs binary executable:

g++ -O3 -fpermissive ExtractLongSTRs.cpp -o extractLongSTRs -I/path/to/htslib/ -L/path/to/htslib/ -L/path/to/libdeflate/ -lhts -lcurl -ldeflate -lz -lpthread

Usage

./extractLongSTRs <input.cram|input.bam> <reference.fasta> <output.tsv>

Arguments:

  • <input.cram|input.bam>: Input file (CRAM or BAM)
  • <reference.fasta>: Reference genome (required for CRAM; ignored for BAM)
  • <output.tsv>: Output file path

GRCh38 reference fasta and fasta index files can be downloaded from the following locations:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai

Note that even for BAM input, a second input argument must be provided but will be ignored (so any input string can be supplied).

Output Format

The output is a tab-delimited file with the following columns:

Column Description
bpDelta Optimal motif periodicity (2 to 6 bases)
mismatches Mismatch count at the optimal periodicity
QNAME Query name (read ID)
FLAG Bitwise SAM flag
RNAME Reference sequence name
POS 1-based leftmost mapping position
QUAL Mapping quality
RNEXT Mate reference sequence name
PNEXT Mate position
SEQ Read sequence (decoded from 4-bit encoding)

The 3rd and subsequent columns are obtained directly from the CRAM/BAM file.

Example

The following command runs the software on a publicly available 1000 Genomes Project 30x WGS CRAM file for HG00118, streamed from a remote URL:

./extractLongSTRs http://1000genomes-dragen-v4.0.3.s3.us-east-1.amazonaws.com/data/1kgp_cram/1000G_2504_high_coverage/ERR3240193/HG00118.final.cram GRCh38_full_analysis_set_plus_decoy_hla.fa example_output_HG00118.tsv

The same example can be run using the (more portable) extractLongSTRs_no_curl precompiled binary executable via process substitution:

./extractLongSTRs_no_curl <(curl -s http://1000genomes-dragen-v4.0.3.s3.us-east-1.amazonaws.com/data/1kgp_cram/1000G_2504_high_coverage/ERR3240193/HG00118.final.cram) GRCh38_full_analysis_set_plus_decoy_hla.fa example_output_HG00118.tsv

Sample output:

Processed 811912061 reads; wrote 14871 to example_output_HG00118.tsv (493.074 sec)

Configuration

The mismatch threshold is currently hardcoded as:

int maxMismatches = 10;

This can be edited in the ExtractLongSTRs.cpp source file if desired, and similarly, SAM flag filters can be modified as well. The program currently excludes secondary, supplementary, and QC-failed QC reads (but not duplicate reads, as in-repeat read (IRR) pairs are sometimes marked as duplicates).

License

GNU General Public License version 3 (GPLv3)

About

Efficiently identifies WGS reads primarily composed of short tandem repeats

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages