Skip to content
sean-mccorkle edited this page Mar 10, 2016 · 4 revisions

Query API

trying to move towards a more concrete plan before the BNL hackathon

Definitions, terminology

  • sequence - string sequence of amino acids
  • sequence id - a unique identifier for each fasta sequence input. i.e. from header from fasta sequence file
  • md5 - hexadecimal md5 computed from protein sequence, all caps and all whitespace and trailing stop removed
  • source_type - i.e. "reference", "isolate", "metagenome"
    • "reference" - one of several reference databases
    • "isolate" - a isolated whole genome assembly implies a known chromosome or contig, position, orientation for gene
    • "metagenome" - ANL processed metagenome implies a known contig, position, orientation for gene
  • match_threshold - some to-be-determined low-end match parameters (%id, length, etc)
  • coordinates - nt position of gene in contig or chromosome (integer >= 0) - applies to gene start,stop
  • coordinate offset - difference between coordinates of query and result genes (not sure about this - gets complicated with ends)
  • position rank - integer indicating order of gene in contig or chromosome relative to other genes (> 0)
  • relative rank - difference between ranks (- or +)
  • chromosome/contig - unique identifiers for every isolate and metagenome chromosome or contig

Data schema data scheme

Basal (low-level) utilities

  • sequence - md5 relationship

      md5 = seq_to_md5( sequence )
    
      sequence = md5_to_seq( md5 )
    
  • sequence - sequence_id relationship

      sequence = seq_id_to_seq( seq_id )
    
      [seq_id, ...] = seq_to_seq_id( sequence )
    
  • seq_id - md5 relationship

      md5 = seq_id_to_md5( seq_id )
    
      [seq_id, ...] = md5_to_seq_id( md5 )
    

Similarity matrix

At the base level, accessed by md5 only.

get_matches( [ list of md5s ], match_threshold )

-> [ [query_md5_1, [ matching_md5, match_score_data ],
                   [ matching_md5, match_score_data ], ...
     [query_md5_2, [ matching_md5, match_parameters ] ] ... ]

(But using md5 to seq_id & sequence mapping functions, higher level versions can accept and return any of these data types) Specifying a list of query md5s here because the matrix will be highly parallelized for sure.

Sequence neighborhood

(other sequences on the same contig or chromosome)

This only makes sense working with sequence_id at the basal level: an md5 can be shared by different sequences on different contigs/chromosomes with different neighbor sequences. md5 conversion will have to be handled before and after by the mapping functions.

[seq_id, ...] = get_co_occurring( seq_id )    # everything on the same contig or chromosome

Do we want coordinates or rank returned as well? now? or maybe later?

get_neighbors( seq_id, limit )      # limit is a rank limit, or a coordinate offset (nt) limit?

return list of seq_ids, with associated position offsets, or rank offsets (for example, -1 to the left, +1 to the right)

These will require lower level queries to get contig/chromosome given a seq_id and queries to get all seq_ids on a given chromosome/contig. Not fleshed out yet....

chromosome = get_chromosome( seq_id )

[ [seq_id, coords, rank], ... ] = get_all_seq_ids( chromosome )

Other Metadata

Leverage m5nr for this?

MG-RAST API

Kbase?

Clone this wiki locally