This repository provides a learning implementation of similarity estimation between genomic sequences using MinHash, inspired by Mash (https://github.com/marbl/Mash) (https://doi.org/10.1186/s13059-016-0997-x) .
- C++ compiler (e.g. g++, clang)
Each DNA sequence is split into overlapping k-mers. For a sequence of length L, there are (L − k + 1) k-mers. Using k-mers greatly reduces large datasets to a manageable feature set. The reverse complement of each k-mer is created and only the lexicographically smallest of the k-mer and its reverse complement is kept, ensuring forward and reverse sequences share the same k-mers.
Each k-mer is hashed to a 32-bit integer using MurmurHash3, a non-cryptographic but fast and well-known hash function. The code for this is adapted from https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp
To avoid storing all k-mers, a bottom-s MinHash sketch (also called “signature”) is created. This makes the comparison of multiple datasets much faster:
-
Initialize an empty array of size s.
-
Only insert a new hash if its smaller than the largest hash in the sketch. The sketch is sorted for fast evaluation.
The result is the s smallest hash values. Two sequences’ sketches can be compared rapidly in O(s) time, independent of original length.
Given two bottom-s sketches
Empirically, the fraction of matching hash values in the sketches approximates the true Jaccard index.
Assuming a simple Poisson model of substitutions, Mash shows that an estimated Jaccard
This converts the estimated jaccard index into an approximate mutation rate per basepair, which
-
Mash: Ondov, B.D., Treangen, T.J., Melsted, P. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 17, 132 (2016). https://doi.org/10.1186/s13059-016-0997-x
-
Appleby, A. MurmurHash3 (https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp)