-
Notifications
You must be signed in to change notification settings - Fork 0
Query API
trying to move towards a more concrete plan before the BNL hackathon
- sequence - string sequence of amino acids
- sequence id - a unique identifier for each fasta sequence input. i.e. from header from fasta sequence file
- md5 - hexadecimal md5 computed from protein sequence, all caps and all whitespace and trailing stop removed
- source_type - i.e. "reference", "isolate", "metagenome"
- "reference" - one of several reference databases
- "isolate" - a isolated whole genome assembly implies a known chromosome or contig, position, orientation for gene
- "metagenome" - ANL processed metagenome implies a known contig, position, orientation for gene
- match_threshold - some to-be-determined low-end match parameters (%id, length, etc)
- coordinates - nt position of gene in contig or chromosome (integer >= 0) - applies to gene start,stop
- coordinate offset - difference between coordinates of query and result genes (not sure about this - gets complicated with ends)
- position rank - integer indicating order of gene in contig or chromosome relative to other genes (> 0)
- relative rank - difference between ranks (- or +)
- chromosome/contig - unique identifiers for every isolate and metagenome chromosome or contig
Data schema

-
sequence - md5 relationship
md5 = seq_to_md5( sequence ) sequence = md5_to_seq( md5 ) -
sequence - sequence_id relationship
sequence = seq_id_to_seq( seq_id ) [seq_id, ...] = seq_to_seq_id( sequence ) -
seq_id - md5 relationship
md5 = seq_id_to_md5( seq_id ) [seq_id, ...] = md5_to_seq_id( md5 )
At the base level, accessed by md5 only.
get_matches( [ list of md5s ], match_threshold )
-> [ [query_md5_1, [ matching_md5, match_score_data ],
[ matching_md5, match_score_data ], ...
[query_md5_2, [ matching_md5, match_parameters ] ] ... ]
(But using md5 to seq_id & sequence mapping functions, higher level versions can accept and return any of these data types) Specifying a list of query md5s here because the matrix will be highly parallelized for sure.
(other sequences on the same contig or chromosome)
This only makes sense working with sequence_id at the basal level: an md5 can be shared by different sequences on different contigs/chromosomes with different neighbor sequences. md5 conversion will have to be handled before and after by the mapping functions.
[seq_id, ...] = get_co_occurring( seq_id ) # everything on the same contig or chromosome
Do we want coordinates or rank returned as well? now? or maybe later?
get_neighbors( seq_id, limit ) # limit is a rank limit, or a coordinate offset (nt) limit?
return list of seq_ids, with associated position offsets, or rank offsets (for example, -1 to the left, +1 to the right)
These will require lower level queries to get contig/chromosome given a seq_id and queries to get all seq_ids on a given chromosome/contig. Not fleshed out yet....
chromosome = get_chromosome( seq_id )
[ [seq_id, coords, rank], ... ] = get_all_seq_ids( chromosome )
Leverage m5nr for this?
Kbase?