Skip to content

Conversation

@yang-ruoxi
Copy link
Member

@yang-ruoxi yang-ruoxi commented Mar 7, 2025

Major changes:

  • Defined a custom ReadOnlyResource for the task collection (still inherit from the maggma ReadOnlyResource)
  • Use custom query operators for TaskReadOnlyResource that work with atlas search
  • Enabled previously non-working queries
  • Added facet query for returning facet information
  • Removed hint scheme for task collection (no longer needed with the new search index)

Designed to work with updated maggma release. Also, the roadmap is to use the same design pattern across all our collections.

TODOs:

  • Update tests

@yang-ruoxi
Copy link
Member Author

Search Index:

  1. Only task_id, formula_pretty , chemsys , task_type , calc_type,and run_type , last_updated are supported. To support, reduced_composition has the dynamic mapping turned on.
  2. calcs_reversed and orig_inputs are excluded from the source_fields; the default behavior of the API is to not include these fields in the returned docs (take too much bandwidths). Only when users request them, these are returned (mongodb then needs to do a full collection lookup based on _id, which is slow). We then use the [returnStoredSource](https://www.mongodb.com/docs/atlas/atlas-search/return-stored-source/#std-label-fts-return-stored-source-option) option to retrieve stored fields.

API service:

  1. TaskReadOnlyResource is a custom made class that inherits the ReadOnlyResource from maggma. The reason is because we want to overwrite the build_dynamic_model_search() method from the existing one.
  2. There are significant changes needed for merge_queries and generate_query_pipeline methods, so I wrote two new util functions merge_atlas_queries and generate_atlas_search_pipeline which are only used in the TaskResource build_dynamic_model_search()
  3. See below how exactly these queries are handled.
  4. We are using the existing api architecture, so we have the default STORE_PARAMS being the same.
  5. No longer need for count or hint scheme.

In the future, the aim is to completely replace the current design query operators. But before that, as a transitional step, we have:

Remaining items:

  • make the ReadOnlyResource and QueryOperator able to accept parameter to decide if using Atlas search.
  • make the search index building dynamic

Client:

  1. We by default exclude calcs_reversed and orig_inputs if not requested.
  2. It now exposes the meta info containing facets.

Query pattern shift:

Previous query pipeline:

All query keys are the field names, e.g.:

 {'nelements': 4, 'composition_reduced.Si': {'$exists': True}, 'composition_reduced.O': {'$exists': True}, 'composition_reduced.K': {'$exists': True}}

These queries go under the “criteria” dictionary, e.g.

{”criteria”: {'nelements': 4, 'composition_reduced.Si': {'$exists': True}, 'composition_reduced.O': {'$exists': True}, 'composition_reduced.K': {'$exists': True}}

They are later inserted into an aggregation pipelilne. Because the actual path is the key, there is no (low) risk of duplication in the query params, so the design of query_operator follows {"criteria": {field: query}}

Now, Atlas Search uses the actual operator as the key. e.g.:

{
equals: {
path: "nelements",
value: 3,
},
},
{
exists: {
path: "composition_reduced.Ac",
},
},
{
exists: {
path: "composition_reduced.Cu",
},
},

Due to this, the query_operator may have duplicate keys (e.g. exists). The solution is to have the query operators follows {”criteria”: list} type. When merging all incoming queries, we simply add to a list of dictionaries to make a compound query like this:

operator: {
compound: {
must: [
{
equals: {
path: "nelements",
value: 3,
},
},
{
exists: {
path: "composition_reduced.Ac",
},
},
{
exists: {
path: "composition_reduced.Cu",
},
},
{
equals: {
path: "calc_type",
value: "GGA Static",
},
},
],
},
},

Note, because we are using existing query infra, the parameters are defined still as:

STORE_PARAMS = dict[
    Literal[
        "criteria",
        "properties",
        "sort",
        "skip",
        "limit",
        "request",
        "pipeline",
        "count_hint",
        "agg_hint",
        "update",
    ],
    Any,
]

To be able to keep using the current implementation. The difference is that the “criteria” stuff was translated to “$match” previously, but is now “$search”

Improvements:

  • No more counting needed for the progress bar (automatically returned in the meta info)
  • No hint scheme needed.

Other considerations:

  • Should we always use a compound search? See performance comparison:

The single operator query is three times faster than the compound query with a single operator.

The single operator query is three times faster than the compound query with a single operator.

  • Last_updated is a string field, but BSON is needed to create the “date” search index. Solution: convert it to ISODate format. We can then use the near range operator on this field.

    from datetime import datetime
    collection.update_one(
    {"_id": ObjectId("your_document_id")},
    {"$set": {"date_field": datetime.strptime("2024-02-20T12:34:56", "%Y-%m-%dT%H:%M:%S")}}
    )
  • Avoid [$match](https://www.mongodb.com/docs/manual/reference/operator/aggregation/match/) After $search

Using a $match aggregation pipeline stage after a [$search](https://www.mongodb.com/docs/atlas/atlas-search/query-syntax/#std-label-query-syntax-ref) stage can drastically slow down query results. If possible, design your $search query so that all necessary filtering occurs in the $searchstage to remove the need for a $match stage.

  • Avoid [$sort](https://www.mongodb.com/docs/manual/reference/operator/aggregation/sort/) After $search

Using a $sort aggregation pipeline stage after a [$search](https://www.mongodb.com/docs/atlas/atlas-search/query-syntax/#std-label-query-syntax-ref) stage can drastically slow down query results.

Benchmarking

Direct query on MongoDB**:

V7 cluster + index vs. V8 cluster + atlas index:

  • v7 cluster
query = {"formula_pretty": "NaCl"}
start_time = time.time()
results = list(v7_collection.find(query))
end_time = time.time()
print(f"Total API Query Time: {end_time - start_time:.4f} seconds")
print(f"Number of results: {len(results)}")
Total API Query Time: 1.6338 seconds
Number of results: 57
  • v8 cluster
# make a compound search: 
must = []
for k,v in query.items(): 
    must.append({"equals": {"path": k, "value": v} })
compound = {
"must": must
}
equals = {"path": list(query.keys())[0],
          "value" : list(query.values())[0]
         }

pipeline = [
{"$search": {"index":"default", 
            "compound": compound,
            "returnStoredSource": True
            },
 
},
]
start_time = time.time()
results = list(v8_collection.aggregate(pipeline))  # Run the actual query, not explain()
end_time = time.time()
Total API Query Time: 0.6591 seconds
Number of results: 57

Other heavier queries:

query = {"chemsys": "O-Si"} (text search)

v7 + index = 402s, 4198 results
v8 + atlas search = 59s, 4198 results

Compound query: (current chemsys query )

query = {"nelements": 2, "composition_reduced.Si": {"$exists": True}, "composition_reduced.O": {"$exists": True}}

v7 + index = TIME OUT
v8 + atlas search = 71.9603 seconds

V7 cluster + regular index + DEV API Server:

mpr.materials.tasks.search(elements = ["O", "Si", "K"]) time out
mpr.materials.tasks.search(formula = "SiO2" ) 64 seconds
mpr.materials.tasks.search(formula = "CsPbBr3" ) Time out

(Compound search really sucks)

V8 cluster + Search Index + DEV API server:

mpr.materials.tasks.search(elements = ["O", "Si", "K"]) 15 seconds
mpr.materials.tasks.search(formula = "SiO2" ) 24 seconds
mpr.materials.tasks.search(formula = "CsPbBr3" ) 0.6 seconds

@tschaume
Copy link
Member

tschaume commented May 1, 2025

FYI @kbuma

@tsmathis tsmathis marked this pull request as draft October 1, 2025 19:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants