Helicase is a carefully optimized FASTA/FASTQ parser that extensively uses vectorized instructions.
It is designed for three main goals: being highly configurable, handling non-ACTG bases and computing bitpacked representations of DNA.
This library requires AVX2 or NEON instruction sets, make sure to enable target-cpu=native when using it:
RUSTFLAGS="-C target-cpu=native" cargo run --releaseuse helicase::input::*;
use helicase::*;
// set the options of the parser (at compile-time)
const CONFIG: Config = ParserOptions::default().config();
fn main() {
let path = "...";
// create a parser with the desired options
let mut parser = FastxParser::<CONFIG>::from_file(&path).expect("Cannot open file");
// iterate over records
while let Some(_event) = parser.next() {
// get a reference to the header
let header = parser.get_header();
// get a reference to the sequence (without newlines)
let seq = parser.get_dna_string();
// ...
}
}The parser supports options that can be adjusted in the ParserOptions.
For instance, if you don't need to look at the headers and you want to skip non-ACTG bases, you can change to configuration to:
const CONFIG: Config = ParserOptions::default()
.ignore_headers()
.skip_non_actg()
.config();The parser can output a bitpacked representation of the sequence in two different formats:
PackedDNAwhich maps each base to two bits and packs them (compatible with packed-seq using the corresponding feature).ColumnarDNAwhich separates the high bit and the low bit of each base, and store them in two bitmasks.
Since each base is encoded using two bits, we have to handle non-ACTG bases differently. Three options are available for that:
split_non_actgsplits the sequence into contiguous chunks of ACTG bases, stopping the iterator at each chunk.skip_non_actgskips the non-ACTG bases and merge the remaining chunks together, stopping once at the end of the record.keep_non_actgkeeps the non-ACTG bases and encodes them with a lossy representation.
use helicase::input::*;
use helicase::*;
const CONFIG: Config = ParserOptions::default()
.dna_packed()
// don't stop the iterator at the end of a record
.return_record(false)
.config();
fn main() {
let path = "...";
let mut parser = FastxParser::<CONFIG>::from_file(&path).expect("Cannot open file");
// iterate over each chunk of ACTG
while let Some(_event) = parser.next() {
// we still have access to the header
let header = parser.get_header();
// get a reference to the packed sequence
let seq = parser.get_dna_packed();
// or directly get a PackedSeq (requires the packed-seq feature)
let packed_seq = parser.get_packed_seq();
// ...
}
}This library supports transparent file decompression using deko, you can choose the supported formats using the following features:
bz2for bzip2 (disabled by default)gzfor gzip (enabled by default)xzfor xz (disabled by default)zstdfor zstd (enabled by default)
The PackedDNA format is compatible with packed-seq and can be converted when the packed-seq feature is enabled (disabled by default).
This can be useful for hashing k-mers or computing minimizers & syncmers.
Benchmarks against needletail and paraseq are available in the bench directory.
You can run them on any (possibly compressed) FASTA/FASTQ file using:
RUSTFLAGS="-C target-cpu=native" cargo r -r --bin bench -- <file>For instance, you can run it on this human genome, these short reads or these long reads.
Note that the FASTQ files can easily be converted to FASTA using:
RUSTFLAGS="-C target-cpu=native" cargo r -r --example fq_to_fa -- <file.fastq>More information in the bench README.
This project was initially started by Loup Lobet during his internship with Charles Paperman.