Add Matbench Discovery benchmark #651

hholb · 2025-12-10T20:44:15Z

Overview

The PR creates the inaugural benchmarking feature for Garden with Matbench Discovery.

Discussion

Everything lives in a new garden_ai.benchmarks module. Here is an example script running the full matbench discovery benchamrk on MACE (more in the garden_ai/benchmarks/matbench_discovery/examples/ folder):

from garden_ai.benchmarks.matbench_discovery import MatbenchDiscovery

ANVIL = "5aafb4c1-27b2-40d8-a038-a0277611868f"

def create_mace_model(device):
    from mace.calculators import mace_mp
    return mace_mp(model="medium-mpa-0", device=device, default_dtype="float64")


def main():
    results = MatbenchDiscovery.IS2RE.remote(
        endpoint=ANVIL,
        user_endpoint_config={
            "scheduler_options": "#SBATCH --gpus-per-node=4\n",
            "walltime": "05:00:00",
            "qos": "gpu",
            "partition": "gpu",
            "account": "cis250461-gpu",
            "cores_per_node": 16,
            "requirements": "",  # 'requirements' is required for Anvil endpoint
        },
        model_factory=create_mace_model,
        model_packages=[
            "mace-torch",
            "cuequivariance",
            "cuequivariance-torch",
            "cuequivariance-ops-torch-cu12",
        ],
        # this tells the remote runner where to save periodic checkpoints,
        # will resume a previous run if it finds an existing checkpoint there
        checkpoint_path="~/.garden/benchmarks/matbench_mace-torch_cuequivariance_full_20251208_115719_ed2e47af.json",
    )

    if "error" in results.get("metrics", {}):
        print(f"Error: {results['metrics']['error']}")
    else:
        print("Benchmark Results:", results)


if __name__ == "__main__":
    main()

The benchmark tasks are implemented as @hog.method()s on the MatbenchDiscovery class. This makes it easy to run on remote sites through globus-compute, or for someone to use the garden SDK directly on a system they have access to using a .local() call. This has the downside that the tasks.py file is pretty huge since we need all of the logic for running the benchmark and calculating the metrics all in one file, but has the benefit that we don't need to add matbench_discovery as a dependency for the garden SDK since groundhog will install it in the venv it creates to run the functions.

I implemented the full list of tasks, defined by the matbench_discovery.enums.Task enum:

class Task(LabelEnum):
    """Thermodynamic stability prediction task types."""
    RP2RE = "RP2RE", "relaxed prototype to relaxed energy"
    RS2RE = "RS2RE", "relaxed structure to relaxed energy"
    S2E = "S2E", "structure to energy"
    # S2RE is for models that learned a discrete version of PES like CGCNN+P
    S2RE = "S2RE", "structure to relaxed energy"
    S2EF = "S2EF", "structure to energy, force"
    S2EFS = "S2EFS", "structure to energy, force, stress"
    S2EFSM = "S2EFSM", "structure to energy, force, stress, magmoms"
    IP2E = "IP2E", "initial prototype to energy"
    IS2E = "IS2E", "initial structure to energy"
    # IS2RE is for models that learned a discrete version of PES like CGCNN+P
    IS2RE = "IS2RE", "initial structure to relaxed energy"

So in theory, any model that can be used as an ASE calculator can run the benchmark no matter what it is trained to do. I have only tested MACE, SevenNet, Mattersim, and EquiformerV2.

The general idea is that you pass in a model_factory which is a function that build and returns a model instance, and the model_packages which is the list of python packages the model factory needs to run. These are called by the tasks in the venv to setup the model for benchmarking. It currently only supports models that are pip-installable, but shouldn't be too hard to pull down an instantiate a model from git (or the future project formerly known as Graft).

Since it takes ~20 gpu hours to run the full benchmark, I implemented a checkpoint/resume system that writes calculated energies and the index of each processed structure to a JSON file in ~/.garden/benchmarks/ on the system running the benchmark. If you give a .submit(), .remote(), .local() call a checkpoint_path kwarg, it will look there for an existing checkpoint file and figure out which structures have already been processed, and resume from there. We print the checkpoint path to stdout when the job starts, and also attach it to the future we get back from a .submit() call. So you can grab the checkpoint path like this:

fut = MatbenchDiscovery.IS2RE.submit(...)
checkpoint = fut.checkpoint_path

Metrics Calculations

I reuse the metric calculation functions matbench uses internally, but had issues importing the function directly from matbench, so I elected to copy the implementation into the tasks.py file. We reproduce the key metrics in the official matbench leaderboard and add some of our own more 'meta' metrics. Here is an example blob of metrics from a MACE run:

TODO add some metrics when the running job finishes

Publishing results

Super users can publish results to the official garden leader board using the publish_benchmark_result helper function. Regular users can call the function, but the backend will reject non-super user's requests.

Testing

Manually tested a bunch of times using the scripts in garden_ai/benchmarks/matbench_discovery/examples/

Documentation

No documentation updates yet. But I will need to write up some tutorials to help users figure out how to run/debug it.

📚 Documentation preview 📚: https://garden-ai--651.org.readthedocs.build/en/651/

… compute, multi-gpu support

…e materials

codecov · 2025-12-10T21:10:11Z

Codecov Report

❌ Patch coverage is 1.85676% with 740 lines in your changes missing coverage. Please review.
✅ Project coverage is 28.81%. Comparing base (0489a92) to head (964c4f1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
garden_ai/benchmarks/matbench_discovery/tasks.py	0.00%	684 Missing ⚠️
garden_ai/benchmarks/matbench_discovery/enums.py	0.00%	29 Missing ⚠️
garden_ai/benchmarks/__init__.py	0.00%	22 Missing ⚠️
...arden_ai/benchmarks/matbench_discovery/__init__.py	0.00%	3 Missing ⚠️
garden_ai/backend_client.py	33.33%	2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (0489a92) and HEAD (964c4f1). Click for more details.

HEAD has 2 uploads less than BASE

Flag BASE (0489a92) HEAD (964c4f1)

4 2

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #651       +/-   ##
===========================================
- Coverage   39.30%   28.81%   -10.49%     
===========================================
  Files          30       35        +5     
  Lines        1743     2513      +770     
===========================================
+ Hits          685      724       +39     
- Misses       1058     1789      +731

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hholb added 15 commits November 19, 2025 13:57

first pass at matbench running locally and on a remote HPC via globus…

478b6d1

… compute, multi-gpu support

multi-gpu setup working nicely

5653925

checkpoint/resume, more examples

e2c2a44

WIP cleanup

77a4a01

refactor to use groundhog functions

f5b888e

update examples scripts, tweak task setup

a9bc8cf

tweak metrics, print checkpoint path for sync calls

9abea84

fix checkpoint path for remote calls

018bb36

cleanup, remove old examples

cd62506

fix checkpoint bug, clean up examples

b12f922

bump groundhog version

5192800

calculate metrics if the given checkpoint file has finished all of th…

060865c

…e materials

remote unused files

8eeefb2

fix type errors

97d7291

appease mypy

0c2997b

hholb added 5 commits December 10, 2025 14:45

implement publish_benchmark_result helper

2f96ee4

cleanup comments

723bb58

add py.typed to appease mypy as per PEP 561

83baaba

add missing schema file :facepalm

ac03466

fix checkpoint resume bug, clean up examples

964c4f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Matbench Discovery benchmark #651

Add Matbench Discovery benchmark #651

Uh oh!

hholb commented Dec 10, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 10, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Matbench Discovery benchmark #651

Are you sure you want to change the base?

Add Matbench Discovery benchmark #651

Uh oh!

Conversation

hholb commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Discussion

Metrics Calculations

Publishing results

Testing

Documentation

Uh oh!

codecov bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hholb commented Dec 10, 2025 •

edited

Loading

codecov bot commented Dec 10, 2025 •

edited

Loading