OpenMM with CUDA support using GPU

Hi @allA3FEteam,

I have been working with A3FE for quite a while now, and I have already opened two issues, i.e., `https://github.com/michellab/a3fe/issues/33` and `https://github.com/michellab/a3fe/issues/45`, which tell a lot about my system. Briefly, I have a system with an enzyme, a sugar acceptor (n=3), a donor substrate (UDP-Galf; carrying -2 charge), and a metal ion (Mn²⁺) in the active site of the enzyme. We were interested in ABFE of UDP-Galf (LIG in the system; dually coordinating with the metal ion). Among all the problems we encountered, one problem was common: the very slow performance of OpenMM (on the cluster). 

For reference, look at the relative simulation cost values and discussion here:

1. `https://github.com/michellab/a3fe/issues/45#issuecomment-2825005315`
2. `https://github.com/michellab/a3fe/issues/45#issuecomment-2824618493`

In all the subsequent discussions on those two issues, @Roy-Haolin-Du and @fjclark were kind enough to follow every detail and sort them out till we reached this issue of slow performance of the cluster with OpenMM.

I checked my OpenMM, and it says this:

```
(a3fe) o_ali@sy129:~$ python -m openmm.testInstallation

OpenMM Version: 8.2
Git Revision: 53770948682c40bd460b39830d4e0f0fd3a4b868

There are 3 Platforms available:

1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 CUDA - Successfully computed forces

Median difference in forces between platforms:

Reference vs. CPU: 6.31069e-06
Reference vs. CUDA: 6.74498e-06
CPU vs. CUDA: 7.114e-07

All differences are within tolerance.
```

When I do nvidia-smi while at one of the compute nodes at our university, it says (remember it's not when we were running abfe via a3fe, its just for ref):

```
o_ali@sy129:~$ nvidia-smi
Wed Jun 18 17:58:02 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.247.01             Driver Version: 535.216.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 NVL                On  | 00000000:02:00.0 Off |                    0 |
| N/A   38C    P0             106W / 400W |    669MiB / 95830MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 NVL                On  | 00000000:64:00.0 Off |                    0 |
| N/A   75C    P0             343W / 400W |   2845MiB / 95830MiB |     84%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 NVL                On  | 00000000:82:00.0 Off |                    0 |
| N/A   64C    P0             285W / 400W |   1443MiB / 95830MiB |     95%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 NVL                On  | 00000000:E3:00.0 Off |                    0 |
| N/A   64C    P0             297W / 400W |   1443MiB / 95830MiB |     95%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1414083      C   ...iabol/anaconda3/envs/kg/bin/python3      660MiB |
|    1   N/A  N/A   1060009      C   ...iabol/anaconda3/envs/kg/bin/python3     2836MiB |
|    2   N/A  N/A   2031809      C   pmemd.cuda                                 1434MiB |
|    3   N/A  N/A   2031802      C   pmemd.cuda                                 1434MiB |
+---------------------------------------------------------------------------------------+

```

According to @fjclark  and @Roy-Haolin-Du recommendations, I ran a short simulation (100ps per lambda) to set the relative simulation cost, which gave me a value of around ~15-16. I set that value manually to set the simulation cost for the full adaptive run, but it was still similarly slow. This super slowness would sometimes result in crashing some of the simulations, which is also the reason why I haven't completed any calculations of my system. :p It also once crashed a whole node with four GPUs. 

At this point, we are unsure what would be causing this slow performance. If you’ve run into something similar—or have any ideas, tuning tips, or diagnostic steps that helped you solve something similar—we would appreciate it if you let us know.

With thanks, 

Yasir 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenMM with CUDA support using GPU #53

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OpenMM with CUDA support using GPU #53

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions