Skip to content

OpenMM with CUDA support using GPU #53

@yasirkhanqu

Description

@yasirkhanqu

Hi @allA3FEteam,

I have been working with A3FE for quite a while now, and I have already opened two issues, i.e., https://github.com/michellab/a3fe/issues/33 and https://github.com/michellab/a3fe/issues/45, which tell a lot about my system. Briefly, I have a system with an enzyme, a sugar acceptor (n=3), a donor substrate (UDP-Galf; carrying -2 charge), and a metal ion (Mn²⁺) in the active site of the enzyme. We were interested in ABFE of UDP-Galf (LIG in the system; dually coordinating with the metal ion). Among all the problems we encountered, one problem was common: the very slow performance of OpenMM (on the cluster).

For reference, look at the relative simulation cost values and discussion here:

  1. https://github.com/michellab/a3fe/issues/45#issuecomment-2825005315
  2. https://github.com/michellab/a3fe/issues/45#issuecomment-2824618493

In all the subsequent discussions on those two issues, @Roy-Haolin-Du and @fjclark were kind enough to follow every detail and sort them out till we reached this issue of slow performance of the cluster with OpenMM.

I checked my OpenMM, and it says this:

(a3fe) o_ali@sy129:~$ python -m openmm.testInstallation

OpenMM Version: 8.2
Git Revision: 53770948682c40bd460b39830d4e0f0fd3a4b868

There are 3 Platforms available:

1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 CUDA - Successfully computed forces

Median difference in forces between platforms:

Reference vs. CPU: 6.31069e-06
Reference vs. CUDA: 6.74498e-06
CPU vs. CUDA: 7.114e-07

All differences are within tolerance.

When I do nvidia-smi while at one of the compute nodes at our university, it says (remember it's not when we were running abfe via a3fe, its just for ref):

o_ali@sy129:~$ nvidia-smi
Wed Jun 18 17:58:02 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.247.01             Driver Version: 535.216.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 NVL                On  | 00000000:02:00.0 Off |                    0 |
| N/A   38C    P0             106W / 400W |    669MiB / 95830MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 NVL                On  | 00000000:64:00.0 Off |                    0 |
| N/A   75C    P0             343W / 400W |   2845MiB / 95830MiB |     84%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 NVL                On  | 00000000:82:00.0 Off |                    0 |
| N/A   64C    P0             285W / 400W |   1443MiB / 95830MiB |     95%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 NVL                On  | 00000000:E3:00.0 Off |                    0 |
| N/A   64C    P0             297W / 400W |   1443MiB / 95830MiB |     95%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   1414083      C   ...iabol/anaconda3/envs/kg/bin/python3      660MiB |
|    1   N/A  N/A   1060009      C   ...iabol/anaconda3/envs/kg/bin/python3     2836MiB |
|    2   N/A  N/A   2031809      C   pmemd.cuda                                 1434MiB |
|    3   N/A  N/A   2031802      C   pmemd.cuda                                 1434MiB |
+---------------------------------------------------------------------------------------+

According to @fjclark and @Roy-Haolin-Du recommendations, I ran a short simulation (100ps per lambda) to set the relative simulation cost, which gave me a value of around ~15-16. I set that value manually to set the simulation cost for the full adaptive run, but it was still similarly slow. This super slowness would sometimes result in crashing some of the simulations, which is also the reason why I haven't completed any calculations of my system. :p It also once crashed a whole node with four GPUs.

At this point, we are unsure what would be causing this slow performance. If you’ve run into something similar—or have any ideas, tuning tips, or diagnostic steps that helped you solve something similar—we would appreciate it if you let us know.

With thanks,

Yasir

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions