Skip to content

Conversation

@hholb
Copy link
Member

@hholb hholb commented Dec 10, 2025

Overview

The PR creates the inaugural benchmarking feature for Garden with Matbench Discovery.

Discussion

Everything lives in a new garden_ai.benchmarks module. Here is an example script running the full matbench discovery benchamrk on MACE (more in the garden_ai/benchmarks/matbench_discovery/examples/ folder):

from garden_ai.benchmarks.matbench_discovery import MatbenchDiscovery

ANVIL = "5aafb4c1-27b2-40d8-a038-a0277611868f"

def create_mace_model(device):
    from mace.calculators import mace_mp
    return mace_mp(model="medium-mpa-0", device=device, default_dtype="float64")


def main():
    results = MatbenchDiscovery.IS2RE.remote(
        endpoint=ANVIL,
        user_endpoint_config={
            "scheduler_options": "#SBATCH --gpus-per-node=4\n",
            "walltime": "05:00:00",
            "qos": "gpu",
            "partition": "gpu",
            "account": "cis250461-gpu",
            "cores_per_node": 16,
            "requirements": "",  # 'requirements' is required for Anvil endpoint
        },
        model_factory=create_mace_model,
        model_packages=[
            "mace-torch",
            "cuequivariance",
            "cuequivariance-torch",
            "cuequivariance-ops-torch-cu12",
        ],
        # this tells the remote runner where to save periodic checkpoints,
        # will resume a previous run if it finds an existing checkpoint there
        checkpoint_path="~/.garden/benchmarks/matbench_mace-torch_cuequivariance_full_20251208_115719_ed2e47af.json",
    )

    if "error" in results.get("metrics", {}):
        print(f"Error: {results['metrics']['error']}")
    else:
        print("Benchmark Results:", results)


if __name__ == "__main__":
    main()

The benchmark tasks are implemented as @hog.method()s on the MatbenchDiscovery class. This makes it easy to run on remote sites through globus-compute, or for someone to use the garden SDK directly on a system they have access to using a .local() call. This has the downside that the tasks.py file is pretty huge since we need all of the logic for running the benchmark and calculating the metrics all in one file, but has the benefit that we don't need to add matbench_discovery as a dependency for the garden SDK since groundhog will install it in the venv it creates to run the functions.

I implemented the full list of tasks, defined by the matbench_discovery.enums.Task enum:

class Task(LabelEnum):
    """Thermodynamic stability prediction task types."""
    RP2RE = "RP2RE", "relaxed prototype to relaxed energy"
    RS2RE = "RS2RE", "relaxed structure to relaxed energy"
    S2E = "S2E", "structure to energy"
    # S2RE is for models that learned a discrete version of PES like CGCNN+P
    S2RE = "S2RE", "structure to relaxed energy"
    S2EF = "S2EF", "structure to energy, force"
    S2EFS = "S2EFS", "structure to energy, force, stress"
    S2EFSM = "S2EFSM", "structure to energy, force, stress, magmoms"
    IP2E = "IP2E", "initial prototype to energy"
    IS2E = "IS2E", "initial structure to energy"
    # IS2RE is for models that learned a discrete version of PES like CGCNN+P
    IS2RE = "IS2RE", "initial structure to relaxed energy"

So in theory, any model that can be used as an ASE calculator can run the benchmark no matter what it is trained to do. I have only tested MACE, SevenNet, Mattersim, and EquiformerV2.

The general idea is that you pass in a model_factory which is a function that build and returns a model instance, and the model_packages which is the list of python packages the model factory needs to run. These are called by the tasks in the venv to setup the model for benchmarking. It currently only supports models that are pip-installable, but shouldn't be too hard to pull down an instantiate a model from git (or the future project formerly known as Graft).

Since it takes ~20 gpu hours to run the full benchmark, I implemented a checkpoint/resume system that writes calculated energies and the index of each processed structure to a JSON file in ~/.garden/benchmarks/ on the system running the benchmark. If you give a .submit(), .remote(), .local() call a checkpoint_path kwarg, it will look there for an existing checkpoint file and figure out which structures have already been processed, and resume from there. We print the checkpoint path to stdout when the job starts, and also attach it to the future we get back from a .submit() call. So you can grab the checkpoint path like this:

fut = MatbenchDiscovery.IS2RE.submit(...)
checkpoint = fut.checkpoint_path

Metrics Calculations

I reuse the metric calculation functions matbench uses internally, but had issues importing the function directly from matbench, so I elected to copy the implementation into the tasks.py file. We reproduce the key metrics in the official matbench leaderboard and add some of our own more 'meta' metrics. Here is an example blob of metrics from a MACE run:

TODO add some metrics when the running job finishes

Publishing results

Super users can publish results to the official garden leader board using the publish_benchmark_result helper function. Regular users can call the function, but the backend will reject non-super user's requests.

Testing

Manually tested a bunch of times using the scripts in garden_ai/benchmarks/matbench_discovery/examples/

Documentation

No documentation updates yet. But I will need to write up some tutorials to help users figure out how to run/debug it.


📚 Documentation preview 📚: https://garden-ai--651.org.readthedocs.build/en/651/

@codecov
Copy link

codecov bot commented Dec 10, 2025

Codecov Report

❌ Patch coverage is 1.85676% with 740 lines in your changes missing coverage. Please review.
✅ Project coverage is 28.81%. Comparing base (0489a92) to head (964c4f1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
garden_ai/benchmarks/matbench_discovery/tasks.py 0.00% 684 Missing ⚠️
garden_ai/benchmarks/matbench_discovery/enums.py 0.00% 29 Missing ⚠️
garden_ai/benchmarks/__init__.py 0.00% 22 Missing ⚠️
...arden_ai/benchmarks/matbench_discovery/__init__.py 0.00% 3 Missing ⚠️
garden_ai/backend_client.py 33.33% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (0489a92) and HEAD (964c4f1). Click for more details.

HEAD has 2 uploads less than BASE
Flag BASE (0489a92) HEAD (964c4f1)
4 2
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #651       +/-   ##
===========================================
- Coverage   39.30%   28.81%   -10.49%     
===========================================
  Files          30       35        +5     
  Lines        1743     2513      +770     
===========================================
+ Hits          685      724       +39     
- Misses       1058     1789      +731     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants