A high-performance Rust tool for filtering DNA/RNA reads based on a set of reference k-mers. Inspired by BBDuk by Brian Bushnell. Provides performant and memory-efficient read processing with support for both paired and unpaired FASTA/FASTQ files, with multiple files or in interleaved format.
K-mer based read filtering:
- Reads are compared to reference sequences by matching k-mers.
- If a read sequence has at least x k-mers also found in reference dataset, it is a match
- x is 1 by default, changed with
--minhits <int>
- x is 1 by default, changed with
Piping:
- Use
--in stdinto pipe from stdin - use
--outm/outu/outm2/outu2stdout.fa/stdout.fqto pipe results to stdout
Paired reads support:
- Paired inputs and outputs can be specified by adding more input/output files
- Interleaved inputs or outputs, signify interleaved input with
--interinput - Automatic detection of input/output mode
Multithreading with Rayon:
- Adjustable thread count via
--threadsargument - Defaults to all available CPU cores
Memory Limit:
- Specify maximum memory usage with
--maxmem <String>(e.g.,5Gfor 5 gigabytes,500Mfor 500 megabytes)
Automatic Reference Indexing:
- Builds a serialized reference k-mer index using Bincode if
--binref <file>is provided from references provided with--ref <file> - Uses saved index on subsequent runs if
--binref <file>is included
If using UNIX, run this command and follow the ensuing instructions:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | shIf using Windows, download the correct installer from Rustup.
brew install nucleazeNucleaze requires --in and at least one of --ref or --binref to be provided.
See more parameter documentation at ./src/main.rs
./nucleaze --in reads.fq --ref refs.fa --outm matched.fq --outu unmatched.fq --k 21This command:
- Builds 21-mer index from
refs.fasequences - Reads input reads from
reads.fqinto chunks of size 10,000 - Processes each read into 21-mers and checks against reference index
- Outputs matched reads to
matched.fqand unmatched reads tounmatched.fq
Output files support either FASTA or FASTQ format.
- Will default to FASTQ unless extension is .fa, .fna, or .fasta
Example console output:
Indexing time: 0.040 seconds
Processing reads from reads.fq using 14 threads
Input and output is processed as interleaved
Processing time: 0.110 seconds
Input: 1000000 reads 150000000 bases
Matches: 20000 reads (2.00%) 3000000 bases (2.00%)
Nonmatches: 980000 reads (98.00%) 147000000 bases (98.00%)
Time: 0.150 seconds
Reads Processed: 1.00m reads 6.65m reads/sec
Bases Processed: 150.00m bases 1000.00m bases/sec
Nucleaze automatically detects the appropriate read handling mode:
| Mode | Description |
|---|---|
Unpaired |
Single input file, unpaired reads |
Paired |
Two input files, separate outputs |
PairedInInterOut |
Two input files, interleaved output |
InterInPairedOut |
Interleaved input file, separate paired outputs |
Interleaved |
Interleaved input and output |
This project is licensed under the MIT License, see LICENSE for details. There is lots of room for improvement here so new additions or suggestions are welcome!
- Needletail — FASTA/FASTQ file parsing and bitkmer operations
- Bincode — K-mer hashset serialization/deserialization
- Rayon — Multithreading
- Clap — CLI
- Num-Cpus — detection of available threads
- Sysinfo — system memory information
- Crossbeam — asynchronous channels
