Home

Welcome to the docs wiki!

The Accumulo documentation is located elsewhere.

The pan-database NR is a union of the NR’s from NCBI, M5NR,
KBase, UniProt, SEED, PATRIC, KEGG,

Papers

Broder, A. Z. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings, 21-29 (IEEE, 1997). URL http://dx.doi.org/10.1109/sequen.1997.666900. This one is a classic paper on ressemblance and containment of documents. Concepts like Jaccard index and minimum hashing are described. One application in that paper is the detection of duplicated and similar documents.

Right now, we are using Accumulo (owner: dan). Accumulo is similar to Cassandra and HBase (implementations of Google BigTable).

Cassandra is also a nice distributed noSQL database. Cassandra has something called CQL which is similar to SQL.

The commons is accum1. That's where the group works.
On UC CI's Beagle the dependencies for the pipeline are installed at /lustre/beagle2/PUP