A pure Python3 k-mer analysis library. This library is agnostic to a type of sequences used and treats them internally as strings.
Python2 is not supported. For python3 it's best to use PyPI.
pip3 install kmers
First, create a Composition object, supplying the value of k.
Initial data can be supplied either through seq (see self.process) or
fh, which accepts a file-like object and reads the composition data
from a file. If both are omitted, an empty Composition object is
created.
import kmers.kmers as kmers
composition = kmers.Composition(k=3, seq=None, fh=None)
Data can be added to an existing model later. You can add either a single SeqRecord object or an iterable of SeqRecords. There is no limit to how many data can be loaded in the single Composition, except hardware limitations. In practice 12-mer distribution of complete H. sapiens proteome, including isoforms, takes more than 10 Gb of RAM using Cython.
composition.process(seq, update=False)
Relative and logarithmic distributions are computed lazily, so it's possible that first time you access them (since adding a sequence) it will take some time. They are available as Composition attributes that support dict API:
composition.relative_distribution['CMLD']
composition.log_distribution['CMLD']
Given sequence a_seq, you can find probability it was generated by this distribuition. Comparing such probabilities for a series of distributions is a primitive, but functional sequence classifier.
p = composition.prob(a_seq)
Given two Composition objects, you can find distance between them. Currently only n-dimensional Euclidean and feature frequency profile (Sims et al. 2008) distance metrics are supported.
e = kmers.euclidean(comp_a, comp_b)
f = kmers.ffp_distance(comp_a, comp_b)
This code is distributed under the terms of MIT license. Unrestricted use or modification of library is allowed provided that original author (A. A. Morozov) is properly cited. If used in scientific publication, please also cite my abstract from BGRS/SB-2016.