Skip to content
sean-mccorkle edited this page Mar 10, 2016 · 14 revisions

Welcome to the docs wiki!

The Accumulo documentation is located elsewhere.

The pan-database NR is a union of the NR’s from NCBI, M5NR,
KBase, UniProt, SEED, PATRIC, KEGG,

Papers

  • Broder, A. Z. On the resemblance and containment of documents. In Compression and Complexity of Sequences 1997. Proceedings, 21-29 (IEEE, 1997). URL http://dx.doi.org/10.1109/sequen.1997.666900. This one is a classic paper on ressemblance and containment of documents. Concepts like Jaccard index and minimum hashing are described. One application in that paper is the detection of duplicated and similar documents.

Design

Database

Right now, we are using Accumulo (owner: dan). Accumulo is similar to Cassandra and HBase (implementations of Google BigTable).

Cassandra is also a nice distributed noSQL database. Cassandra has something called CQL which is similar to SQL.

Staging

The commons is accum1. That's where the group works.
On UC CI's Beagle the dependencies for the pipeline are installed at /lustre/beagle2/PUP

Clone this wiki locally