Skip to content

Lucene module, search by accession and other issues #24

@marco-brandizi

Description

@marco-brandizi

The LuceneEnv and LuceneQueryBuilder components are rather messy and it's often unclear how they should be used and for what purpose its methods were designed. For instance, searchConceptByConceptAccessionExact() and similar *Exact methods were searching accessions by keywords, so not so exact. Now I've fixed them, but the search is still case-insensitive, because the Lucene standard analyzer can't deal with case-sensitive indexing+searches, nor is it easy to switch to another analyzer, read on for details.

Because of the same reason, I've had to introduce an analyzer (DEFAULTANALYZER) that uses PerFieldAnalyzerWrapper to use different analyzers for fields like concept class ID (uses keyword analyzer) or concept attribute value (uses the standard analyzer). The rationale of this is that the ID fields are to be indexed and searched with a full identity criterion, while others are dedicated to user free-text searches and hence are best served by the standard analyzer (ie, tokenisation, stop words, case insensivity, etc). In fact, if fields like Concept Class ID or data source name are indexed and searched with the standard analyzer, we have a number of problems, like upper case strings not working at all when saved as StringField, or unwanted substring matching (eg, "00633" matches both "00633" and "go 00633", which, in general, is wrong). Details are discussed here and some tests of mine are here.

Switching accession fields isn't so easy and I'll do it later. The problem with them is that they're saved with a field name like ConceptAccession_<dataSourceId>, eg, ConceptAccession_GO. This doesn't fit into the way PerFieldAnalyzerWrapper works (ie, it uses a map of field name -> analyzer), plus, it doesn't seem to play well with un-tokenised fields that can be multi-value.

As the latest link suggests, the proper solution is to store separated documents for the 1-n accession values (ie, one Lucene document with concept ID + accession + data source per each accession, which might result into multiple documents of this type, sharing the same concept ID, or even the same concept ID + data source).

But even before that, it would be worth to check the last fixes mentioned above. Regarding this:

Knetminer doesn't seem to be affected.
As for Ondex, there are a couple of plugins using the accession search methods mentioned above, which don't seem to be in use. These are:

  • decypher module (Blast, Decypher, Hmmer, Mapping), this is in the modules-opt subtree
  • generic module
    • relationneighbours/Filter
    • net.sourceforge.ondex.mapping.accessionbased.Mapping (seems no longer in use, replaced by lowmemoryaccessionbased)
    • net.sourceforge.ondex.mapping.lowmemoryaccessionbased.Mapping (fixed)
  • Mappers in go module (I think it was replaced by the OWL parser).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions