Make task resource use atlas search #1203

yang-ruoxi · 2025-03-07T21:17:40Z

Major changes:

Defined a custom ReadOnlyResource for the task collection (still inherit from the maggma ReadOnlyResource)
Use custom query operators for TaskReadOnlyResource that work with atlas search
Enabled previously non-working queries
Added facet query for returning facet information
Removed hint scheme for task collection (no longer needed with the new search index)

Designed to work with updated maggma release. Also, the roadmap is to use the same design pattern across all our collections.

TODOs:

Update tests

yang-ruoxi · 2025-03-19T22:21:18Z

Search Index:

Only task_id, formula_pretty , chemsys , task_type , calc_type,and run_type , last_updated are supported. To support, reduced_composition has the dynamic mapping turned on.
calcs_reversed and orig_inputs are excluded from the source_fields; the default behavior of the API is to not include these fields in the returned docs (take too much bandwidths). Only when users request them, these are returned (mongodb then needs to do a full collection lookup based on _id, which is slow). We then use the [returnStoredSource](https://www.mongodb.com/docs/atlas/atlas-search/return-stored-source/#std-label-fts-return-stored-source-option) option to retrieve stored fields.

API service:

TaskReadOnlyResource is a custom made class that inherits the ReadOnlyResource from maggma. The reason is because we want to overwrite the build_dynamic_model_search() method from the existing one.
There are significant changes needed for merge_queries and generate_query_pipeline methods, so I wrote two new util functions merge_atlas_queries and generate_atlas_search_pipeline which are only used in the TaskResource build_dynamic_model_search()
See below how exactly these queries are handled.
We are using the existing api architecture, so we have the default STORE_PARAMS being the same.
No longer need for count or hint scheme.

In the future, the aim is to completely replace the current design query operators. But before that, as a transitional step, we have:

Remaining items:

make the ReadOnlyResource and QueryOperator able to accept parameter to decide if using Atlas search.
make the search index building dynamic

Client:

We by default exclude calcs_reversed and orig_inputs if not requested.
It now exposes the meta info containing facets.

Query pattern shift:

Previous query pipeline:

All query keys are the field names, e.g.:

 {'nelements': 4, 'composition_reduced.Si': {'$exists': True}, 'composition_reduced.O': {'$exists': True}, 'composition_reduced.K': {'$exists': True}}

These queries go under the “criteria” dictionary, e.g.

{”criteria”: {'nelements': 4, 'composition_reduced.Si': {'$exists': True}, 'composition_reduced.O': {'$exists': True}, 'composition_reduced.K': {'$exists': True}}

They are later inserted into an aggregation pipelilne. Because the actual path is the key, there is no (low) risk of duplication in the query params, so the design of query_operator follows {"criteria": {field: query}}

Now, Atlas Search uses the actual operator as the key. e.g.:

{
equals: {
path: "nelements",
value: 3,
},
},
{
exists: {
path: "composition_reduced.Ac",
},
},
{
exists: {
path: "composition_reduced.Cu",
},
},

Due to this, the query_operator may have duplicate keys (e.g. exists). The solution is to have the query operators follows {”criteria”: list} type. When merging all incoming queries, we simply add to a list of dictionaries to make a compound query like this:

operator: {
compound: {
must: [
{
equals: {
path: "nelements",
value: 3,
},
},
{
exists: {
path: "composition_reduced.Ac",
},
},
{
exists: {
path: "composition_reduced.Cu",
},
},
{
equals: {
path: "calc_type",
value: "GGA Static",
},
},
],
},
},

Note, because we are using existing query infra, the parameters are defined still as:

STORE_PARAMS = dict[
    Literal[
        "criteria",
        "properties",
        "sort",
        "skip",
        "limit",
        "request",
        "pipeline",
        "count_hint",
        "agg_hint",
        "update",
    ],
    Any,
]

To be able to keep using the current implementation. The difference is that the “criteria” stuff was translated to “$match” previously, but is now “$search”

Improvements:

No more counting needed for the progress bar (automatically returned in the meta info)
No hint scheme needed.

Other considerations:

Should we always use a compound search? See performance comparison:

The single operator query is three times faster than the compound query with a single operator.

Last_updated is a string field, but BSON is needed to create the “date” search index. Solution: convert it to ISODate format. We can then use the near range operator on this field.

from datetime import datetime
collection.update_one(
{"_id": ObjectId("your_document_id")},
{"$set": {"date_field": datetime.strptime("2024-02-20T12:34:56", "%Y-%m-%dT%H:%M:%S")}}
)

Avoid [$match](https://www.mongodb.com/docs/manual/reference/operator/aggregation/match/) After $search

Using a $match aggregation pipeline stage after a [$search](https://www.mongodb.com/docs/atlas/atlas-search/query-syntax/#std-label-query-syntax-ref) stage can drastically slow down query results. If possible, design your $search query so that all necessary filtering occurs in the $searchstage to remove the need for a $match stage.

Avoid [$sort](https://www.mongodb.com/docs/manual/reference/operator/aggregation/sort/) After $search

Using a $sort aggregation pipeline stage after a [$search](https://www.mongodb.com/docs/atlas/atlas-search/query-syntax/#std-label-query-syntax-ref) stage can drastically slow down query results.

Benchmarking

Direct query on MongoDB**:

V7 cluster + index vs. V8 cluster + atlas index:

v7 cluster

query = {"formula_pretty": "NaCl"}
start_time = time.time()
results = list(v7_collection.find(query))
end_time = time.time()
print(f"Total API Query Time: {end_time - start_time:.4f} seconds")
print(f"Number of results: {len(results)}")

Total API Query Time: 1.6338 seconds
Number of results: 57

v8 cluster

# make a compound search: 
must = []
for k,v in query.items(): 
    must.append({"equals": {"path": k, "value": v} })
compound = {
"must": must
}
equals = {"path": list(query.keys())[0],
          "value" : list(query.values())[0]
         }

pipeline = [
{"$search": {"index":"default", 
            "compound": compound,
            "returnStoredSource": True
            },
 
},
]
start_time = time.time()
results = list(v8_collection.aggregate(pipeline))  # Run the actual query, not explain()
end_time = time.time()

Total API Query Time: 0.6591 seconds
Number of results: 57

Other heavier queries:

query = {"chemsys": "O-Si"} (text search)

v7 + index = 402s, 4198 results
v8 + atlas search = 59s, 4198 results

Compound query: (current chemsys query )

query = {"nelements": 2, "composition_reduced.Si": {"$exists": True}, "composition_reduced.O": {"$exists": True}}

v7 + index = TIME OUT
v8 + atlas search = 71.9603 seconds

V7 cluster + regular index + DEV API Server:

mpr.materials.tasks.search(elements = ["O", "Si", "K"]) time out
mpr.materials.tasks.search(formula = "SiO2" ) 64 seconds
mpr.materials.tasks.search(formula = "CsPbBr3" ) Time out

(Compound search really sucks)

V8 cluster + Search Index + DEV API server:

mpr.materials.tasks.search(elements = ["O", "Si", "K"]) 15 seconds
mpr.materials.tasks.search(formula = "SiO2" ) 24 seconds
mpr.materials.tasks.search(formula = "CsPbBr3" ) 0.6 seconds

tschaume · 2025-05-01T20:07:17Z

FYI @kbuma

yang-ruoxi added 14 commits February 19, 2025 15:18

use atlase search param for task endpoint

3615df1

chemsys compound search, facet query

64257e0

add more queries and separate facet

e921d31

add facet to meta and remove print

c430b1d

fix las_updated query

dae5348

add sort and skip query

9b9e8fe

add task elements query

d0a702d

remove hint, lint

1385457

lint

814ce60

Merge branch 'main' into atlas_search_update

c8168da

fix test and lint

cde3d58

add formula to search util

62ddc5c

use formula_to_search util for formula query

58d3359

pre-commit

ffbc144

tschaume assigned kbuma Sep 12, 2025

tsmathis marked this pull request as draft October 1, 2025 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make task resource use atlas search #1203

Make task resource use atlas search #1203

Uh oh!

yang-ruoxi commented Mar 7, 2025 •

edited

Loading

Uh oh!

yang-ruoxi commented Mar 19, 2025

Uh oh!

tschaume commented May 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Make task resource use atlas search #1203

Are you sure you want to change the base?

Make task resource use atlas search #1203

Uh oh!

Conversation

yang-ruoxi commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Major changes:

TODOs:

Uh oh!

yang-ruoxi commented Mar 19, 2025

Search Index:

API service:

Client:

Query pattern shift:

Benchmarking

Direct query on MongoDB**:

Other heavier queries:

Uh oh!

tschaume commented May 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yang-ruoxi commented Mar 7, 2025 •

edited

Loading