Lucene module, search by accession and other issues

The `LuceneEnv` and `LuceneQueryBuilder` components are rather messy and it's often unclear how they should be used and for what purpose its methods were designed. For instance, `searchConceptByConceptAccessionExact()` and similar `*Exact` methods were searching accessions by keywords, so not so exact. Now I've fixed them, but the search is still case-insensitive, because the Lucene standard analyzer can't deal with case-sensitive indexing+searches, nor is it easy to switch to another analyzer, read on for details.

Because of the same reason, I've had to introduce an analyzer (`DEFAULTANALYZER`) that uses `PerFieldAnalyzerWrapper` to use different analyzers for fields like concept class ID (uses keyword analyzer) or concept attribute value (uses the standard analyzer). The rationale of this is that the ID fields are to be indexed and searched with a full identity criterion, while others are dedicated to user free-text searches and hence are  best served by the standard analyzer (ie, tokenisation, stop words, case insensivity, etc). In fact, if fields like Concept Class ID or data source name are indexed and searched with the standard analyzer, we have a number of problems, like upper case strings not working at all when saved as StringField, or unwanted substring matching (eg, "00633" matches both "00633" and "go 00633", which, in general, is wrong). [Details are discussed here][10] and [some tests of mine are here][20].  

Switching accession fields isn't so easy and I'll do it later. The problem with them is that they're saved with a field name like `ConceptAccession_<dataSourceId>`, eg, `ConceptAccession_GO`. This doesn't fit into the way `PerFieldAnalyzerWrapper` works (ie, it uses a map of field name -> analyzer), plus, it [doesn't seem to play well with un-tokenised fields that can be multi-value][30].  

As the latest link suggests, the proper solution is to store separated documents for the 1-n accession values (ie, one Lucene document with concept ID + accession + data source per each accession, which might result into multiple documents of this type, sharing the same concept ID, or even the same concept ID + data source).  

But even before that, it would be worth to check the last fixes mentioned above. Regarding this:  

Knetminer doesn't seem to be affected.
As for Ondex, there are a couple of plugins using the accession search methods mentioned above, which don't seem to be in use. These are:

* `decypher` module (`Blast`, `Decypher`, `Hmmer`, `Mapping`), this is in the `modules-opt` subtree
* `generic` module
  * `relationneighbours/Filter`
  * `net.sourceforge.ondex.mapping.accessionbased.Mapping` (seems no longer in use, replaced by `lowmemoryaccessionbased`)
  * `net.sourceforge.ondex.mapping.lowmemoryaccessionbased.Mapping` (fixed)
* Mappers in `go` module (I think it was replaced by the OWL parser).

[10]: https://stackoverflow.com/questions/62119328
[20]: https://github.com/marco-brandizi/lucene-learn/blob/master/src/test/java/info/marcobrandizi/learn/lucene/IDLuceneTest.java
[30]: https://stackoverflow.com/a/21028490/529286

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lucene module, search by accession and other issues #24

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Lucene module, search by accession and other issues #24

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions