Neighbor list benchmark #452

janosh · 2026-02-12T01:35:21Z

janosh
Feb 12, 2026
Maintainer

posting this here for later reference and as a prompt in case others have related/contradictory results worth sharing: NL benchmark comparing matscipy, vesin, ase, alchemi on 1, 10, 100, 1000 structure relaxations using Nequip-OAM-L + LBFGS running on single H200 with InflightAutoBatcher, f_max=5e-3, max_steps=1000

NL	n systems	jobid	status	model_load_s	opt_s	total_s	converged
matscipy	1	212954	OK	25.84	3.41	30.75	1
matscipy	10	212955	OK	26.82	4.22	32.71	10
matscipy	100	212956	OK	27.14	37.42	66.92	95
matscipy	1000	212957	OK	27.69	463.25	501.64	903
vesin	1	212958	OK	26.97	3.45	31.66	1
vesin	10	212959	OK	25.67	4.31	31.33	10
vesin	100	212960	OK	27.02	35.48	64.66	96
vesin	1000	212961	OK	26.68	431.99	468.48	903
ase	1	212962	OK	26.74	3.52	31.48	1
ase	10	212963	OK	26.39	4.89	32.55	10
ase	100	212964	OK	25.78	127.12	154.89	95
ase	1000	212965	TIMEOUT / missing metrics	—	—	—	—
alchemi	1	212966	OK	25.44	3.39	30.29	1
alchemi	10	212967	OK	25.52	4.30	31.32	10
alchemi	100	212968	OK	25.40	38.33	65.78	95
alchemi	1000	212969	OK	25.28	443.90	479.78	902

main surprise was that nvalchemi-toolkit-ops==0.2.0 didn't provide a speedup even though it's the (only?) GPU-compatible batched neighbor list implementation. vesin beats it at every structure count and matscipy not far behind. maybe i didn't use nvalchemi-toolkit-ops==0.2.0 correctly but i thought there's no setup. ase is the main outlier, lot slower than the other 3

nikitafedik · 2026-02-13T17:22:32Z

nikitafedik
Feb 13, 2026

Hey @janosh! I am technical marketing engineer with ALCHEMI. These numbers for NL in Toolkit-Ops are not what we expected, and I would like to have a closer look if possible. Can you, please, share the structures you are testing on and potentially benchmark suite or script? Thanks a lot

1 reply

janosh Feb 14, 2026
Maintainer Author

i can't share the structures i'm running but here's a script that downloads the WBM structures via matbench_discovery and should produce the same results (did not verify) since the structure sizes are similar. you can run it directly with uv (deps embedded in script metadata):

uv run --script nl_benchmark.py --source wbm --n-structures 1000 --seed 0 --nl-backend matscipy --max-steps 1000 --f-max 5e-3 --device cuda

nl_benchmark.py

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "ase",
#   "ferrox",
#   "matbench-discovery",
#   "mp-api",
#   "nequip",
#   "numpy",
#   "openequivariance",
#   "pymatgen",
#   "torch",
#   "torch-sim",
# ]
# ///
Neighbor-list benchmark using random MP or WBM structures.

Example:
    uv run python scripts/geo_opt_pipeline/nl_benchmark_public.py \
      --source wbm --n-structures 100 --nl-backend matscipy --device cuda
"""

from __future__ import annotations

import argparse
import json
import os
import random
import subprocess
import sys
import time
from pathlib import Path
from typing import Any

import numpy as np

NEQUIP_MODEL_ID = "mir-group/NequIP-OAM-L:0.1"
VALID_NL_BACKENDS = ("matscipy", "vesin", "ase", "alchemi")


def parse_args() -> argparse.Namespace:
    """Parse CLI args."""
    parser = argparse.ArgumentParser(
        description=(
            "Benchmark torch-sim neighbor-list backends on random public structures."
        )
    )
    parser.add_argument(
        "--source",
        choices=("mp", "wbm"),
        required=True,
        help="Public structure source: mp (Materials Project) or wbm (Matbench Discovery).",
    )
    parser.add_argument(
        "--n-structures",
        type=int,
        default=100,
        help="How many random structures to benchmark.",
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=0,
        help="Random seed for reproducible sampling.",
    )
    parser.add_argument(
        "--nl-backend",
        choices=VALID_NL_BACKENDS,
        default="matscipy",
        help="Neighbor-list backend passed to NequIPFrameworkModel.",
    )
    parser.add_argument(
        "--max-steps",
        type=int,
        default=1000,
        help="Maximum LBFGS optimization steps.",
    )
    parser.add_argument(
        "--f-max",
        "--fmax",
        dest="f_max",
        type=float,
        default=5e-3,
        help="Force convergence threshold.",
    )
    parser.add_argument(
        "--device",
        default="cuda",
        help='Torch device, e.g. "cuda" or "cpu".',
    )
    parser.add_argument(
        "--dtype",
        choices=("float32", "float64"),
        default="float64",
        help="Torch dtype for simulation tensors.",
    )
    parser.add_argument(
        "--max-memory-scaler",
        type=int,
        default=None,
        help="Optional InFlightAutoBatcher max_memory_scaler.",
    )
    parsed_args = parser.parse_args()
    if parsed_args.n_structures <= 0:
        parser.error("--n-structures must be > 0")
    if parsed_args.max_steps <= 0:
        parser.error("--max-steps must be > 0")
    if parsed_args.f_max <= 0:
        parser.error("--f-max must be > 0")
    return parsed_args


def compile_nequip_model(
    model_id: str = NEQUIP_MODEL_ID,
    device: str = "cuda",
) -> Path:
    """Compile (and cache) the NequIP model used by torch-sim."""
    compiled_dir = Path.home() / ".nequip" / "compiled"
    model_slug = model_id.split("/")[-1].replace(":", "-")
    device_slug = device.replace(":", "-")
    compiled_path = compiled_dir / f"{model_slug}.{device_slug}.oeq.nequip.pth"
    if compiled_path.exists():
        return compiled_path

    compiled_dir.mkdir(parents=True, exist_ok=True)
    compile_cmd = [
        sys.executable,
        "-m",
        "nequip.scripts.compile",
        f"nequip.net:{model_id}",
        str(compiled_path),
        "--mode",
        "torchscript",
        "--device",
        device,
        "--target",
        "batch",
        "--modifiers",
        "enable_OpenEquivariance",
    ]
    env = os.environ.copy()
    venv_bin = str(Path(sys.executable).parent)
    env["PATH"] = f"{venv_bin}:{env.get('PATH', '')}"
    result = subprocess.run(
        compile_cmd,
        capture_output=True,
        text=True,
        env=env,
        check=False,
    )
    if result.returncode != 0:
        raise RuntimeError(
            f"NequIP compile failed (exit={result.returncode}).\n"
            f"stdout:\n{result.stdout}\n\nstderr:\n{result.stderr}"
        )
    return compiled_path


def _sample_mp_structures(n_structures: int, seed: int) -> list[dict[str, Any]]:
    """Fetch random MP structures as pymatgen dicts."""
    from mp_api.client import MPRester

    if not os.environ.get("MP_API_KEY"):
        raise RuntimeError("MP_API_KEY is required for --source mp")

    with MPRester() as mpr:
        # Reservoir sampling gives an unbiased sample over the full result stream
        # without loading all IDs into memory.
        sampled_material_ids: list[str] = []
        py_rng = random.Random(seed)
        id_docs = mpr.summary.search(
            fields=["material_id"],
            all_fields=False,
            chunk_size=2_000,
        )
        for stream_idx, doc in enumerate(id_docs):
            material_id = str(doc.material_id)
            if stream_idx < n_structures:
                sampled_material_ids.append(material_id)
                continue
            replacement_idx = py_rng.randint(0, stream_idx)
            if replacement_idx < n_structures:
                sampled_material_ids[replacement_idx] = material_id

        if len(sampled_material_ids) < n_structures:
            raise RuntimeError(
                f"Requested {n_structures} structures but only found "
                f"{len(sampled_material_ids)} in MP."
            )

        structure_docs = mpr.summary.search(
            material_ids=sampled_material_ids,
            fields=["material_id", "structure"],
            all_fields=False,
        )

    structure_by_id: dict[str, dict[str, Any]] = {}
    for doc in structure_docs:
        structure_by_id[str(doc.material_id)] = doc.structure.as_dict()

    missing_ids = [
        material_id
        for material_id in sampled_material_ids
        if material_id not in structure_by_id
    ]
    if missing_ids:
        raise RuntimeError(
            f"Failed to fetch structures for {len(missing_ids)} sampled MP IDs."
        )

    return [structure_by_id[material_id] for material_id in sampled_material_ids]


def _sample_wbm_structures(n_structures: int, seed: int) -> list[dict[str, Any]]:
    """Fetch random WBM structures as pymatgen dicts.

    Uses ``matbench_discovery.data.ase_atoms_from_zip`` to match upstream WBM loading.
    """
    from matbench_discovery.data import DataFiles, ase_atoms_from_zip
    from pymatgen.io.ase import AseAtomsAdaptor

    wbm_zip_path = DataFiles.wbm_initial_atoms.path
    all_atoms = ase_atoms_from_zip(wbm_zip_path)
    if n_structures > len(all_atoms):
        raise RuntimeError(
            f"Requested {n_structures} structures, but WBM has {len(all_atoms)}"
        )

    np_rng = np.random.default_rng(seed=seed)
    sampled_indices = np_rng.choice(len(all_atoms), size=n_structures, replace=False)
    sampled_atoms = [all_atoms[int(idx)] for idx in sampled_indices.tolist()]
    adaptor = AseAtomsAdaptor()
    return [adaptor.get_structure(atoms).as_dict() for atoms in sampled_atoms]


def load_public_structures(
    source: str,
    n_structures: int,
    seed: int,
) -> list[dict[str, Any]]:
    """Load random structures from the requested public source."""
    if source == "mp":
        return _sample_mp_structures(n_structures=n_structures, seed=seed)
    if source == "wbm":
        return _sample_wbm_structures(n_structures=n_structures, seed=seed)
    raise ValueError(f"Unsupported source: {source}")


def _configure_torch_serialization_compat() -> None:
    """Configure safe globals for older e3nn constants on newer torch.

    PyTorch 2.6 switched ``torch.load(..., weights_only=True)`` by default. Some
    e3nn releases still load ``constants.pt`` with Python ``slice`` objects. This
    allowlist preserves secure loading while restoring compatibility.
    """
    import torch

    add_safe_globals = getattr(torch.serialization, "add_safe_globals", None)
    if callable(add_safe_globals):
        add_safe_globals([slice])


def run_benchmark(args: argparse.Namespace) -> dict[str, Any]:
    """Run one benchmark and return compact metrics."""
    import ferrox
    import torch
    import torch_sim as ts
    from torch_sim.autobatching import InFlightAutoBatcher

    _configure_torch_serialization_compat()
    try:
        from torch_sim.models.nequip_framework import NequIPFrameworkModel
    except Exception as exc:
        raise RuntimeError(
            "Failed to import NequIP/torch-sim dependencies. "
            "This usually means a local torch/torchvision/torchmetrics/nequip "
            "version mismatch in the active environment."
        ) from exc

    wall_start = time.perf_counter()

    structures = load_public_structures(
        source=args.source,
        n_structures=args.n_structures,
        seed=args.seed,
    )
    json_structures = [json.dumps(structure) for structure in structures]
    batch = ferrox.io.structures_to_torch_sim_state(json_structures)

    torch_dtype = torch.float64 if args.dtype == "float64" else torch.float32
    sim_state = ts.SimState(
        positions=torch.tensor(
            batch["positions"], dtype=torch_dtype, device=args.device
        ),
        masses=torch.tensor(batch["masses"], dtype=torch_dtype, device=args.device),
        cell=torch.tensor(batch["cell"], dtype=torch_dtype, device=args.device),
        pbc=True,
        atomic_numbers=torch.tensor(
            batch["atomic_numbers"], dtype=torch.int, device=args.device
        ),
        system_idx=torch.tensor(
            batch["system_idx"], dtype=torch.long, device=args.device
        ),
    )

    model_load_start = time.perf_counter()
    compiled_path = compile_nequip_model(model_id=NEQUIP_MODEL_ID, device=args.device)
    model = NequIPFrameworkModel.from_compiled_model(
        compile_path=str(compiled_path),
        device=args.device,
        chemical_species_to_atom_type_map=True,
        neighbor_list_backend=args.nl_backend,
    )
    model_load_s = time.perf_counter() - model_load_start

    opt_start = time.perf_counter()
    convergence_fn = ts.runners.generate_force_convergence_fn(force_tol=args.f_max)
    autobatcher = InFlightAutoBatcher(
        model=model,
        memory_scales_with="n_atoms_x_density",
        max_memory_scaler=args.max_memory_scaler,
    )

    final_state = ts.optimize(
        system=sim_state,
        model=model,
        optimizer=ts.Optimizer.lbfgs,
        max_steps=args.max_steps,
        convergence_fn=convergence_fn,
        steps_between_swaps=5,
        autobatcher=autobatcher,
        init_kwargs={"cell_filter": ts.CellFilter.frechet},
    )
    opt_s = time.perf_counter() - opt_start

    final_states = (
        final_state.split() if isinstance(final_state, ts.SimState) else final_state
    )

    converged_count = 0
    for single_state in final_states:
        forces = model(single_state)["forces"]
        f_max_value = float(torch.linalg.norm(forces, dim=1).max().item())
        if f_max_value <= args.f_max:
            converged_count += 1

    measured_count = len(final_states)
    total_s = time.perf_counter() - wall_start
    throughput_struct_per_min = (
        (measured_count / total_s) * 60.0 if total_s > 0 else 0.0
    )

    return {
        "source": args.source,
        "nl_backend": args.nl_backend,
        "n_structures": measured_count,
        "seed": args.seed,
        "device": args.device,
        "dtype": args.dtype,
        "f_max": args.f_max,
        "max_steps": args.max_steps,
        "model_load_s": round(model_load_s, 3),
        "opt_s": round(opt_s, 3),
        "total_s": round(total_s, 3),
        "converged": converged_count,
        "throughput_struct_per_min": round(throughput_struct_per_min, 2),
    }


if __name__ == "__main__":
    print(json.dumps(run_benchmark(parse_args()), indent=2))

Jussmith01 · 2026-02-23T21:29:54Z

Jussmith01
Feb 23, 2026

Hi @janosh — I've been investigating this and have some findings to share, but first a quick question: what version of NequIP were you running for this benchmark?

With nequip==0.16.2 (current PyPI release and main branch), passing --nl-backend alchemi crashes on the first forward pass with an UnboundLocalError in nequip/data/_nl.py — the _nl_fn function only has branches for "ase", "matscipy", and "vesin", with no handling for "alchemi". Were you running from a development branch that added ALCHEMI support there?

Want to make sure we're looking at the same code before posting our full analysis.

3 replies

janosh Feb 26, 2026
Maintainer Author

@Jussmith01 thanks for taking a look! here's some more info about my local env:

nequip==0.16.2
nequip.data._nl._nl_fn does include an NL == "alchemi" branch in my uv venv (edit: been 2 weeks but i vaguely remember adding it manually, filestamp shows nequip/data/_nl.py mtime: 2026-02-12 01:09:25 UTC but nequip-0.16.2.dist-info/RECORD mtime: 2026-02-11 19:27:54 UTC so uv install was run half a day earlier, that might be the culprit)

i added debug instrumentation into nequip.data._nl._nl_fn and printed selected backend on each call. logs:

[DEBUG_NL] call=1 backend=alchemi
...
[DEBUG_NL] call=370 backend=alchemi
{
  "source": "wbm",
  "nl_backend": "alchemi",
  "n_structures": 10,
  "seed": 0,
  "device": "cuda",
  "dtype": "float64",
  "f_max": 0.005,
  "max_steps": 1000,
  "model_load_s": 21.686,
  "opt_s": 5.739,
  "total_s": 63.865,
  "converged": 10,
  "throughput_struct_per_min": 9.39
}

i also reran the script i posted above since it relaxes different structures than the table i posted initially. every NL now a bit faster because the structures are smaller:

NL	n systems	jobid	status	model_load_s	opt_s	total_s	converged
matscipy	1	247057	OK	21.735	4.861	62.589	1
matscipy	10	247058	OK	20.847	5.498	63.231	10
matscipy	100	247059	OK	22.262	19.749	79.866	100
matscipy	1000	247060	OK	21.378	152.955	228.334	1000
vesin	1	247061	OK	21.948	4.889	62.536	1
vesin	10	247062	OK	21.432	5.529	63.198	10
vesin	100	247063	OK	21.381	20.384	79.611	100
vesin	1000	247064	OK	19.725	158.202	235.973	999
ase	1	247065	OK	19.811	6.242	66.461	1
ase	10	247066	OK	19.722	8.224	68.563	10
ase	100	247067	OK	19.778	41.873	104.216	100
ase	1000	247068	OK	19.722	354.778	436.451	1000
alchemi	1	247069	OK	19.705	5.979	66.083	1
alchemi	10	247070	OK	19.642	6.885	67.108	10
alchemi	100	247071	OK	19.721	24.726	86.628	100
alchemi	1000	247072	OK	24.653	186.862	269.095	1000

should i be adding debug logs at a different level? not seeing any UnboundLocalError in _nl.py

Jussmith01 Feb 26, 2026

Thanks @janosh for the detailed follow-up. Based on the information you've shared, we believe the ALCHEMI numbers in your benchmark may not be using ALCHEMI's batched GPU neighbor list correctly — though we can't be 100% certain without seeing your exact patch. Here's our reasoning.

What we think happened

You mentioned manually adding an elif NL == "alchemi": branch inside nequip.data._nl._nl_fn. If that's the case, there are two structural issues in stock NequIP's NL path that would prevent ALCHEMI's batched GPU kernel from working as intended:

1. Per-system loop. _nl_fn is called from compute_neighborlist_, which loops over frames one at a time:

# nequip/data/_nl.py, compute_neighborlist_()
for idx in range(AtomicDataDict.num_frames(data)):
    data_per_frame = AtomicDataDict.frame_from_batched(data, idx)
    ...
    edge_index, edge_cell_shift = _nl_fn(
        pos=data_per_frame[AtomicDataDict.POSITIONS_KEY],
        ...
    )

If ALCHEMI is called inside _nl_fn, it would receive a single system's positions on each call rather than the full batch, defeating the purpose of a batched GPU kernel.

2. CPU conversion before dispatch. _nl_fn converts positions to CPU numpy on line 75 (pos.detach().cpu().numpy()) before reaching any backend branch. If the ALCHEMI branch was added below that line, the data would already be off GPU.

Your debug logs are consistent with per-system dispatch: 370 calls for 10 structures = ~37 NL calls per structure, which is what you'd expect from the per-frame loop running once per optimization step. That said, if your patch was structured differently (e.g., short-circuiting before the numpy conversion, or patching at the compute_neighborlist_ level instead), these issues might not apply.

Temporary fix to use ALCHEMI correctly with NequIP

The fix is to replace NeighborListTransform entirely rather than patching _nl_fn. Here's a minimal drop-in:

import torch
from nequip.data import AtomicDataDict
from nequip.data.transforms.neighborlist import (
    NeighborListTransform,
    SortedNeighborListTransform,
)
from torch_sim.neighbors import alchemiops_nl_cell_list


class AlchemiNLTransform(torch.nn.Module):
    """Drop-in replacement for NeighborListTransform that calls
    ALCHEMI's batched GPU kernel on the full batch in one shot."""

    def __init__(self, r_max: float):
        super().__init__()
        self.r_max = r_max

    def forward(self, data: AtomicDataDict.Type) -> AtomicDataDict.Type:
        positions = data[AtomicDataDict.POSITIONS_KEY]
        cell = data[AtomicDataDict.CELL_KEY]
        pbc = data[AtomicDataDict.PBC_KEY]
        system_idx = data[AtomicDataDict.BATCH_KEY]
        cutoff = torch.tensor(
            self.r_max, dtype=positions.dtype, device=positions.device
        )

        mapping, _sys_map, shifts_idx = alchemiops_nl_cell_list(
            positions=positions,
            cell=cell,
            pbc=pbc,
            cutoff=cutoff,
            system_idx=system_idx,
        )

        data[AtomicDataDict.EDGE_INDEX_KEY] = mapping
        data[AtomicDataDict.EDGE_CELL_SHIFT_KEY] = shifts_idx
        return data


def patch_nequip_with_alchemi(model):
    """Swap NeighborListTransform for AlchemiNLTransform in-place."""
    for i, transform in enumerate(model.transforms):
        if isinstance(transform, (NeighborListTransform, SortedNeighborListTransform)):
            model.transforms[i] = AlchemiNLTransform(
                r_max=transform.r_max
            ).to(model._device)
            return
    raise RuntimeError("NeighborListTransform not found in model.transforms")

Usage:

model = NequIPFrameworkModel.from_compiled_model(
    "path/to/model.nequip",
    device="cuda",
    neighbor_list_backend="matscipy",  # initial backend doesn't matter
)
patch_nequip_with_alchemi(model)
# NL now runs as a single batched GPU kernel on the full batch

What we see with correct batching

In our testing with NequIP-OAM-L on an RTX 6000 Ada (full inference calls: NL build + energy + autograd forces), we see up to ~1.7x e2e speedup on the full inference call when using ALCHEMI's batched GPU NL versus CPU alternatives on large batches of small structures where the GPU is fully saturated.

If our diagnosis is correct, the reason your benchmark didn't see this is that the _nl_fn patch was calling ALCHEMI one structure at a time, paying kernel launch overhead on every call instead of amortizing it across the batch. Let me know if this helps.

janosh Feb 26, 2026
Maintainer Author

thanks so much, excellent troubleshooting! i tried the AlchemiNLTransform snippet you posted and it worked like a charm! here are the new wall times which look like GPU-batched neighbor list construction is now in effect:

NL	n systems	jobid	status	model_load_s	opt_s	total_s	converged
alchemi (patched)	1	247125	OK	21.641	4.889	62.670	1
alchemi (patched)	10	247126	OK	21.695	5.389	63.407	10
alchemi (patched)	100	247127	OK	21.607	18.382	77.951	100
alchemi (patched)	1000	247128	OK	21.565	132.705	208.402	1000

at n=1000 structures, alchemi's opt_s is down to ~133s. cuts the old 269.095 more than in half! very nice speedup!

@Jussmith01 how preliminary is your proposed AlchemiNLTransform? too early to PR over at https://github.com/mir-group/nequip?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neighbor list benchmark #452

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Neighbor list benchmark #452

Uh oh!

Uh oh!

janosh Feb 12, 2026 Maintainer

Replies: 2 comments · 4 replies

Uh oh!

nikitafedik Feb 13, 2026

Uh oh!

Uh oh!

janosh Feb 14, 2026 Maintainer Author

Uh oh!

Jussmith01 Feb 23, 2026

Uh oh!

Uh oh!

janosh Feb 26, 2026 Maintainer Author

Uh oh!

Jussmith01 Feb 26, 2026

What we think happened

Temporary fix to use ALCHEMI correctly with NequIP

What we see with correct batching

Uh oh!

janosh Feb 26, 2026 Maintainer Author

janosh
Feb 12, 2026
Maintainer

Replies: 2 comments 4 replies

nikitafedik
Feb 13, 2026

janosh Feb 14, 2026
Maintainer Author

Jussmith01
Feb 23, 2026

janosh Feb 26, 2026
Maintainer Author

janosh Feb 26, 2026
Maintainer Author