KaMRaT is a C++ tool for finding substrings with interesting properties in large NGS datasets.
KaMRaT requires a k-mer count matrix extracted from the NGS files (e.g. with Jellyfish), and potential design table or FASTA file (see below for more information).
KaMRaT then provides a set of tools for reducing the k-mer matrix and extending k-mers to longer contigs. The main modules are:
-
kamrat index: index feature* count table on disk
-
kamrat filter: remove/retain features* by expression level
-
kamrat mask: select and suppress k-mers matching given fasta sequences
-
kamrat merge: merge k-mers into contigs
-
kamrat score: score features* by classification performance, statistical significance, correlation, or variability
-
kamrat query: estimate count vectors of given list of contigs
Note: * features can be not only k-mers or k-mer contigs, but also general features such as genes or transcripts.
KaMRaT means "k-mer Matrix Reduction Toolkit", or "k-mer Matrix, Really Tremendous !".
KaMRaT has been published on Bioinformatics.
Before using KaMRaT, please refer to our License.
Note: this page contains a TL;DR version of the software installation and usages. Please refer to our Wiki page for detailed information.
KaMRaT is designed as a flexible toolkit for combinations of different operations. The workflow always starts from index, then other operations can be combined according to user's analysis design, including:
- a single operation of filter, mask, merge, score or query;
- filter-merge or filter-score;
- mask-merge or mask-score;
- merge-score;
- score-merge.
Currently, KaMRaT only accepts k-mers no longer than 32nt, since the k-mers are coded in an uint64 variable.
Besides, we recommend users to choose k as an odd number, to avoid confounding one k-mer with its reverse complement counterpart in unstranded data. For example, in the situation k=6, 6-mers such as AAATTT lose their information of strandedness.
The current release of KaMRaT is v1.2. Compared to its previous release, v1.1, it introduces several new characteristics:
- Each module of
filter,mergeandscorenow supports outputting a fasta file containing selected/merged sequences. - The output tables of all modules
filter,mask,merge,scoreandquerynow output the count table with values being rounted to the nearest integers. The decimal values can be output by setting-countsargument. - The
indexmodule now allows a normalisation factor base too large or too small, instead of throwing an exception, it puts a warning. - The
maskmodule now allows simultaneously indicating sequences to select and suppress. - Updated license.
- Bugfix: the
mergemodule was mistakenly output '<' as fasta header leading character, corrected to '>'.
For full release notes, please refer to our wiki page.
It's highly recommended to directly use KaMRaT within apptainer/singularity container for users at any level unless the task involves in software development, because:
- Building the container is simple and only requires the dependency of
apptainer/singularitywhich should be pre-installed on most of HPC clusters. - Usage of KaMRaT in container ensures better reproducibility of results.
- Several companion scripts can be easily run within the container, e.g., for input k-mer matrix construction.
Pre-built image is provided on DockerHub. To build in local, please run:
apptainer build KaMRaT.sif docker://xuehl/kamrat:latest
# alternatively to build singularity image: simply replace "apptainer" to "singularity"Please refer to our Wiki page for:
- more detailed information on KaMRaT usage within
apptainer/singularitycontainer; - alternatively way of installation by building the software from source.
KaMRaT can be run generally in the fashion:
apptainer exec -B /bind_src:/bind_des kamrat <CMD> [options] /path/from/{bind_des}/to/input/kmer/table
# <CMD> can be one of index, filter, mask, merge, score, query
# replace "apptainer" to "singularity" when KaMRaT is built by singularityThe top-level helper is reachable by:
apptainer exec kamratHelpers of specific KaMRaT modules are accessible via:
apptainer exec kamrat <CMD>
# <CMD> can be one of index, filter, mask, merge, score, queryPlease refer to our Wiki page for detailed software usage description.
Demostrations of example usecases can be found here.
Armadillo:
- Conrad Sanderson and Ryan Curtin. Armadillo: a template-based C++ library for linear algebra. Journal of Open Source Software, Vol. 1, pp. 26, 2016.
- Conrad Sanderson and Ryan Curtin. A User-Friendly Hybrid Sparse Matrix Class in C++. Lecture Notes in Computer Science (LNCS), Vol. 10931, pp. 422-430, 2018.
DE-kupl: Audoux, J., Philippe, N., Chikhi, R. et al. DE-kupl: exhaustive capture of biological variation in RNA-seq data through k-mer decomposition. Genome Biol 18, 243, 2017.
MLPack: R.R. Curtin, M. Edel, M. Lozhnikov, Y. Mentekidis, S. Ghaisas, S. Zhang. mlpack 3: a fast, flexible machine learning library. Journal of Open Source Software 3:26, 2018.
