SVMlight text files to scipy CSR

Many sparse datasets are distributed in a lightweight text format called svmlight. While simple and familiar, it's terribly slow to read in python even with C++ solutions due to serial processing. Instead, svm2csr loads by using a parallel Rust extension which chunks files into byte blocks, then seeks to different blocks to parse in parallel.

# benchmark dataset is kdda training set, 2.5GB flat text
# https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html

import sklearn.datasets
%timeit sklearn.datasets.load_svmlight_file('kdda')
1min 56s ± 1.72 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# https://github.com/mblondel/svmlight-loader
%timeit svmlight_loader.load_svmlight_file('kdda')
1min 52s ± 3.11 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

import svm2csr
%timeit svm2csr.load_svmlight_file('kdda')
11.4 s ± 527 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Above micro-benchmark performed on my 8-core laptop.

Install

pip install svm2csr

Note this package is only available pre-built for pythons, operating systems, and machine architecture targets I can build wheels for (see Publishing). Settings other than the following need to install rust and compile from source (pip install should still work, but will compile for your platform).

cp36-cp39, manylinux2010, x86_64

One important difference from SVMLight is this package allows for the default value of features. I.e., the line

3.2 3:-0.2 6 7 8:1.0 9

is parsed as having label 3.2 and sparse vector {3: -0.2, 6: 1.0, 7: 1.0, 8: 1.0, 9: 1.0}. This is just done so that Vowpal Wabbit style input data can be accepted without preprocessing.

Unsupported Features

dtype (currently only doubles supported)
an svmlight ranking mode where query ids are identified with qid
comments in svmlight files (start with #)
empty or blank lines
multilabel extension
reading from compressed files
reading from multiple files and stacking
reading from streams
writing SVMlight files
n_features option
graceful client multiprocessing
mac and windows wheels

All of these are fixable (even stream reading with parallel bridge). Let me know if you'd like to make PR.

Documentation

def load_svmlight_file(fname, zero_based="auto", min_chunk_size=(16 * 1024)):
    """
    Loads an SVMlight file into a CSR matrix.

    fname (str): the file name of the file to load.
    zero_based ("auto" or bool): whether the corresponding svmlight file uses
        zero based indexing; if false or all indices are nonzero, then
        shifts indices down uniformly by 1 for python's zero indexing.
    min_chunk_size (int): minimum chunk size in bytes per
        parallel processing task

    Returns (X, y) where X is a sparse CSR matrix and y is a numpy double array
    with length equal to the number of rows in X. Values of X are doubles.
    """

Dev Info

Install maturin and pytest first.

pip install maturin pytest

Local development.

cargo test # test rust only
maturin develop # create py bindings for rust code
pytest # test python bindings

Publishing

Fetch the most recent master.
Bump the version in Cargo.toml appropriately if needed. Commit these changes.
Tag the release. git tag -a -m "v<CURRENT VERSION>" "v<CURRENT VERSION>"
Push to github, triggering a Travis build that tests, packages, and uploads to pypi. git push --follow-tags

Every master travis build attempts to publish to pypi (but may fail if a build with the same version is already present).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SVMlight text files to scipy CSR

Install

Unsupported Features

Documentation

Dev Info

Publishing

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

SVMlight text files to scipy CSR

Install

Unsupported Features

Documentation

Dev Info

Publishing