Skip to content

imartayan/helicase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Helicase

Helicase is a carefully optimized FASTA/FASTQ parser that extensively uses vectorized instructions.

It is designed for three main goals: being highly configurable, handling non-ACTG bases and computing bitpacked representations of DNA.

Documentation

Requirements

This library requires AVX2 or NEON instruction sets, make sure to enable target-cpu=native when using it:

RUSTFLAGS="-C target-cpu=native" cargo run --release

Usage

Minimal example

use helicase::input::*;
use helicase::*;

// set the options of the parser (at compile-time)
const CONFIG: Config = ParserOptions::default().config();

fn main() {
    let path = "...";

    // create a parser with the desired options
    let mut parser = FastxParser::<CONFIG>::from_file(&path).expect("Cannot open file");

    // iterate over records
    while let Some(_event) = parser.next() {
        // get a reference to the header
        let header = parser.get_header();

        // get a reference to the sequence (without newlines)
        let seq = parser.get_dna_string();

        // ...
    }
}

Adjusting the configuration

The parser supports options that can be adjusted in the ParserOptions. For instance, if you don't need to look at the headers and you want to skip non-ACTG bases, you can change to configuration to:

const CONFIG: Config = ParserOptions::default()
    .ignore_headers()
    .skip_non_actg()
    .config();

Bitpacked DNA formats

The parser can output a bitpacked representation of the sequence in two different formats:

  • PackedDNA which maps each base to two bits and packs them (compatible with packed-seq using the corresponding feature).
  • ColumnarDNA which separates the high bit and the low bit of each base, and store them in two bitmasks.

Since each base is encoded using two bits, we have to handle non-ACTG bases differently. Three options are available for that:

  • split_non_actg splits the sequence into contiguous chunks of ACTG bases, stopping the iterator at each chunk.
  • skip_non_actg skips the non-ACTG bases and merge the remaining chunks together, stopping once at the end of the record.
  • keep_non_actg keeps the non-ACTG bases and encodes them with a lossy representation.

Iterating over chunks of packed DNA

use helicase::input::*;
use helicase::*;

const CONFIG: Config = ParserOptions::default()
    .dna_packed()
    // don't stop the iterator at the end of a record
    .return_record(false)
    .config();

fn main() {
    let path = "...";

    let mut parser = FastxParser::<CONFIG>::from_file(&path).expect("Cannot open file");

    // iterate over each chunk of ACTG
    while let Some(_event) = parser.next() {
        // we still have access to the header
        let header = parser.get_header();

        // get a reference to the packed sequence
        let seq = parser.get_dna_packed();

        // or directly get a PackedSeq (requires the packed-seq feature)
        let packed_seq = parser.get_packed_seq();

        // ...
    }
}

Crate features

Decompression

This library supports transparent file decompression using deko, you can choose the supported formats using the following features:

  • bz2 for bzip2 (disabled by default)
  • gz for gzip (enabled by default)
  • xz for xz (disabled by default)
  • zstd for zstd (enabled by default)

Packed-seq

The PackedDNA format is compatible with packed-seq and can be converted when the packed-seq feature is enabled (disabled by default).

This can be useful for hashing k-mers or computing minimizers & syncmers.

Benchmarks

Benchmarks against needletail and paraseq are available in the bench directory. You can run them on any (possibly compressed) FASTA/FASTQ file using:

RUSTFLAGS="-C target-cpu=native" cargo r -r --bin bench -- <file>

For instance, you can run it on this human genome, these short reads or these long reads.

Note that the FASTQ files can easily be converted to FASTA using:

RUSTFLAGS="-C target-cpu=native" cargo r -r --example fq_to_fa -- <file.fastq>

More information in the bench README.

Acknowledgements

This project was initially started by Loup Lobet during his internship with Charles Paperman.

About

Tearing through FASTA/Q sequences

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published