IN DEVELOPMENT
This SDK is intended to:
- Read molecules from SDF, Smiles, Mol, CML files, etc
- Index them into Elasticsearch
- Have ability to search molecules efficiently with different similarity metrics (Tanimoto, Tversky, Euclid)
- Filter additionally based on text or number fields attached to the records
We are supporting 7.15.x Elasticsearch and most major distributions available (AWS, Elastic, OpenDistro, etc)
TBD test against other 7.x versions
Install dependency using pip
pip install bingo-elastic
Install async version
pip install bingo-elastic[async]
bingo-elastic async version supports all the same methods to index and search
molecules as sync. To use async version, just instantiate AsyncElasticRepository
You could use any favourite Elasticsearch distribution:
- Open Distro Elasticsearch
- Elasticsearch
- OpenSearch
- many many more available on premise and as cloud products & services
Something simple could be done as following:
docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -e "indices.query.bool.max_clause_count=4096" docker.elastic.co/elasticsearch/elasticsearch:7.17.28
Sync
repository = ElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
Async
repository = AsyncElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
...
repository.close()
Async version also supports async context manager to auto close connections:
async with AsyncElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200) as rep:
...
Using with FastAPI:
app = FastAPI()
rep = AsyncElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
# This gets called once the app is shutting down.
@app.on_event("shutdown")
async def app_shutdown():
await rep.close()
Other customisations like SSL, custom number of shards/replicas, refresh interval, and many more are supported by ElasticRepository and AsyncElasticRepository
IndigoRecord can be created from IndigoObject.
Full usage example:
from bingo_elastic.model.record import IndigoRecord
from indigo import Indigo
indigo = Indigo()
compound = indigo.loadMoleculeFromFile("composition.mol")
indigo_record = IndigoRecord(indigo_object=compound)
bingo_elastic provides helpers to load sdf, cml, smiles and smi files
from bingo_elastic.model import helpers
sdf = helpers.iterate_sdf("compounds.sdf")
cml = helpers.iterate_cml("compounds.cml")
smi = helpers.iterate_smiles("compounds.smi")
Also function helpers.iterate_file(file: Path) is available. This function
selects correct iterate function by file extension. The file argument must
be pathlib.Path instance
from bingo_elastic.model import helpers
from pathlib import Path
sdf = helpers.iterate_file(Path("compounds.sdf"))
Full usage example sync:
from bingo_elastic.model import helpers
from bingo_elastic.elastic import ElasticRepository, IndexName
from pathlib import Path
repository = ElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
sdf = helpers.iterate_file(Path("compounds.sdf"))
repository.index_records(sdf)
Full usage example async:
Async indexing and search requires event loop created
import asyncio
from bingo_elastic.model import helpers
from bingo_elastic.elastic import AsyncElasticRepository, IndexName
from pathlib import Path
async def index_compounds():
repository = AsyncElasticRepository(IndexName.BINGO_MOLECULE, host="127.0.0.1", port=9200)
sdf = helpers.iterate_file(Path("compounds.sdf"))
await repository.index_records(sdf)
asyncio.run(index_compounds)
CAVEAT: Elasticsearch doesn't have strict notion of commit, so records might appear in the index later on Read more about it here - https://www.elastic.co/guide/en/elasticsearch/reference/master/index-modules.html#index-refresh-interval-setting
For indexing one record the the method ElasticRepository.index_record can be used
Sync:
from bingo_elastic.predicates import SimilarityMatch
alg = SimilarityMatch(target, 0.9)
similar_records = repository.filter(similarity=alg, limit=20)
Async:
from bingo_elastic.predicates import SimilarityMatch
alg = SimilarityMatch(target, 0.9)
similar_records = await repository.filter(similarity=alg, limit=20)
In this case we requested top-20 most similar molecules compared to target based on Tanimoto similarity metric
Supported similarity algorithms:
SimilarityMatchorTanimotoSimilarityMatchEuclidSimilarityMatchTverskySimilarityMatch
To run exact match, your target must be either IndigoRecordMolecule or IndigoRecordReaction
Sync:
indigo = Indigo()
molecule = indigo.loadMolecule("CCO")
target = IndigoRecordMolecule(indigo_object=molecule)
exact_records = repo.filter(exact=target, indigo_session=indigo, limit=20)
indigo = Indigo()
records = indigo.loadReaction("C=C.BrBr>>C(CBr)Br")
target = IndigoRecordReaction(indigo_object=molecule)
exact_records = repo.filter(exact=target, indigo_session=indigo, limit=20)
Async:
indigo = Indigo()
molecule = indigo.loadMolecule("CCO")
target = IndigoRecordMolecule(indigo_object=molecule)
exact_records = await repo.filter(exact=target, indigo_session=indigo, limit=20)
indigo = Indigo()
records = indigo.loadReaction("C=C.BrBr>>C(CBr)Br")
target = IndigoRecordReaction(indigo_object=molecule)
exact_records = await repo.filter(exact=target, indigo_session=indigo, limit=20)
To run substructure search, your target must be query molecule
Sync:
indigo = Indigo()
target = indigo.loadQueryMolecule("CCO")
submatch_records = repo.filter(substructure=target, indigo_session=indigo, limit=20)
indigo = Indigo()
target = indigo.loadQueryReaction("C=C>>")
submatch_records = repo.filter(substructure=target, indigo_session=indigo, limit=20)
Async:
indigo = Indigo()
target = indigo.loadQueryMolecule("CCO")
submatch_records = await repo.filter(substructure=target, indigo_session=indigo, limit=20)
indigo = Indigo()
target = indigo.loadQueryReaction("C=C>>")
submatch_records = await repo.filter(substructure=target, indigo_session=indigo, limit=20)
Note: Bingo is requesting data from Elasticsearch with batches. You can control it with page_size argument
Async protocol exact same, just don't forget to await
Indexing records with custom fields
indigo_record = IndigoRecord(indigo_object=compound)
indigo_record.chembl_id = "CHEMBL2063090"
indigo_record.compound_key = "GRAZOPREVIR"
indigo_record.internal_id = 10001
Searching similar molecules to the target and filtering only those that have value of the chembl_id equals to CHEMBL2063090
from bingo_elastic.queries import KeywordQuery
alg = TanimotoSimilarityMatch(target)
result = elastic_repository.filter(similarity=alg,
chembl_id=KeywordQuery("CHEMBL2063090"))
Or you can just write:
result = elastic_repository.filter(similarity=alg,
chembl_id=RangeQuery(1, 10000))
You could also use similarly wildcard and range queries
from bingo_elastic.queries import WildcardQuery
result = elastic_repository.filter(chembl_id=WildcardQuery("CHEMBL2063*"))
from bingo_elastic.queries import RangeQuery
result = elastic_repository.filter(internal_id=RangeQuery(1000, 100000))