Query API

trying to move towards a more concrete plan before the BNL hackathon

Definitions, terminology

sequence - string sequence of amino acids
sequence id - a unique identifier for each fasta sequence input. i.e. from header from fasta sequence file
md5 - hexadecimal md5 computed from protein sequence, all caps and all whitespace and trailing stop removed
source_type - i.e. "reference", "isolate", "metagenome"
- "reference" - one of several reference databases
- "isolate" - a isolated whole genome assembly implies a known chromosome or contig, position, orientation for gene
- "metagenome" - ANL processed metagenome implies a known contig, position, orientation for gene
match_threshold - some to-be-determined low-end match parameters (%id, length, etc)
coordinates - nt position of gene in contig or chromosome (integer >= 0) - applies to gene start,stop
coordinate offset - difference between coordinates of query and result genes (not sure about this - gets complicated with ends)
position rank - integer indicating order of gene in contig or chromosome relative to other genes (> 0)
relative rank - difference between ranks (- or +)
chromosome/contig - unique identifiers for every isolate and metagenome chromosome or contig

Data schema data scheme

Basal (low-level) utilities

sequence - md5 relationship

  md5 = seq_to_md5( sequence )

  sequence = md5_to_seq( md5 )

sequence - sequence_id relationship

  sequence = seq_id_to_seq( seq_id )

  [seq_id, ...] = seq_to_seq_id( sequence )

seq_id - md5 relationship

  md5 = seq_id_to_md5( seq_id )

  [seq_id, ...] = md5_to_seq_id( md5 )

Similarity matrix

At the base level, accessed by md5 only.

get_matches( [ list of md5s ], match_threshold )

-> [ [query_md5_1, [ matching_md5, match_score_data ],
                   [ matching_md5, match_score_data ], ...
     [query_md5_2, [ matching_md5, match_parameters ] ] ... ]

(But using md5 to seq_id & sequence mapping functions, higher level versions can accept and return any of these data types) Specifying a list of query md5s here because the matrix will be highly parallelized for sure.

Sequence neighborhood

(other sequences on the same contig or chromosome)

This only makes sense working with sequence_id at the basal level: an md5 can be shared by different sequences on different contigs/chromosomes with different neighbor sequences. md5 conversion will have to be handled before and after by the mapping functions.

[seq_id, ...] = get_co_occurring( seq_id )    # everything on the same contig or chromosome

Do we want coordinates or rank returned as well? now? or maybe later?

get_neighbors( seq_id, limit )      # limit is a rank limit, or a coordinate offset (nt) limit?

return list of seq_ids, with associated position offsets, or rank offsets (for example, -1 to the left, +1 to the right)

These will require lower level queries to get contig/chromosome given a seq_id and queries to get all seq_ids on a given chromosome/contig. Not fleshed out yet....

chromosome = get_chromosome( seq_id )

[ [seq_id, coords, rank], ... ] = get_all_seq_ids( chromosome )

Other Metadata

Table design sheet

Leverage m5nr for this?

MG-RAST API

Kbase?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query API

Query API

Definitions, terminology

Basal (low-level) utilities

Similarity matrix

Sequence neighborhood

Other Metadata

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally