-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Hi @allA3FEteam,
I have been working with A3FE for quite a while now, and I have already opened two issues, i.e., https://github.com/michellab/a3fe/issues/33 and https://github.com/michellab/a3fe/issues/45, which tell a lot about my system. Briefly, I have a system with an enzyme, a sugar acceptor (n=3), a donor substrate (UDP-Galf; carrying -2 charge), and a metal ion (Mn²⁺) in the active site of the enzyme. We were interested in ABFE of UDP-Galf (LIG in the system; dually coordinating with the metal ion). Among all the problems we encountered, one problem was common: the very slow performance of OpenMM (on the cluster).
For reference, look at the relative simulation cost values and discussion here:
https://github.com/michellab/a3fe/issues/45#issuecomment-2825005315https://github.com/michellab/a3fe/issues/45#issuecomment-2824618493
In all the subsequent discussions on those two issues, @Roy-Haolin-Du and @fjclark were kind enough to follow every detail and sort them out till we reached this issue of slow performance of the cluster with OpenMM.
I checked my OpenMM, and it says this:
(a3fe) o_ali@sy129:~$ python -m openmm.testInstallation
OpenMM Version: 8.2
Git Revision: 53770948682c40bd460b39830d4e0f0fd3a4b868
There are 3 Platforms available:
1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 CUDA - Successfully computed forces
Median difference in forces between platforms:
Reference vs. CPU: 6.31069e-06
Reference vs. CUDA: 6.74498e-06
CPU vs. CUDA: 7.114e-07
All differences are within tolerance.
When I do nvidia-smi while at one of the compute nodes at our university, it says (remember it's not when we were running abfe via a3fe, its just for ref):
o_ali@sy129:~$ nvidia-smi
Wed Jun 18 17:58:02 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.247.01 Driver Version: 535.216.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 NVL On | 00000000:02:00.0 Off | 0 |
| N/A 38C P0 106W / 400W | 669MiB / 95830MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 NVL On | 00000000:64:00.0 Off | 0 |
| N/A 75C P0 343W / 400W | 2845MiB / 95830MiB | 84% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 NVL On | 00000000:82:00.0 Off | 0 |
| N/A 64C P0 285W / 400W | 1443MiB / 95830MiB | 95% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 NVL On | 00000000:E3:00.0 Off | 0 |
| N/A 64C P0 297W / 400W | 1443MiB / 95830MiB | 95% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1414083 C ...iabol/anaconda3/envs/kg/bin/python3 660MiB |
| 1 N/A N/A 1060009 C ...iabol/anaconda3/envs/kg/bin/python3 2836MiB |
| 2 N/A N/A 2031809 C pmemd.cuda 1434MiB |
| 3 N/A N/A 2031802 C pmemd.cuda 1434MiB |
+---------------------------------------------------------------------------------------+
According to @fjclark and @Roy-Haolin-Du recommendations, I ran a short simulation (100ps per lambda) to set the relative simulation cost, which gave me a value of around ~15-16. I set that value manually to set the simulation cost for the full adaptive run, but it was still similarly slow. This super slowness would sometimes result in crashing some of the simulations, which is also the reason why I haven't completed any calculations of my system. :p It also once crashed a whole node with four GPUs.
At this point, we are unsure what would be causing this slow performance. If you’ve run into something similar—or have any ideas, tuning tips, or diagnostic steps that helped you solve something similar—we would appreciate it if you let us know.
With thanks,
Yasir