Skip to content

Low GPU Utilization When a3fe enter the stage of ensemble equilibration #50

@gkxiao

Description

@gkxiao

Environment​​

• OS: Ubuntu 24.04.2 LTS
• Hardware: Dual NVIDIA RTX 4090 GPUs (24GB VRAM each), 64+ CPU cores
• Software:
a3fe: 0.33
GROMACS (compiled with CUDA)

GROMACS version:     2025.1
Precision:           mixed
Memory model:        64 bit
MPI library:         thread_mpi
OpenMP support:      enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support:         CUDA
NBNxM GPU setup:     super-cluster 2x2x2 / cluster 8 (cluster-pair splitting on)
SIMD instructions:   AVX2_256
CPU FFT library:     fftw-3.3.10-sse2-avx-avx2-avx2_128
GPU FFT library:     cuFFT
Multi-GPU FFT:       none
RDTSCP usage:        enabled
TNG support:         enabled
Hwloc support:       disabled
Tracing support:     disabled
C compiler:          /usr/bin/cc GNU 13.3.0
C compiler flags:    -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -O3 -DNDEBUG
C++ compiler:        /usr/bin/c++ GNU 13.3.0
C++ compiler flags:  -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -Wno-cast-function-type-strict SHELL:-fopenmp -O3 -DNDEBUG
BLAS library:        Internal
LAPACK library:      Internal
CUDA compiler:       /usr/local/cuda-12.6/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2024 NVIDIA Corporation;Built on Tue_Oct_29_23:50:19_PDT_2024;Cuda compilation tools, release 12.6, V12.6.85;Build cuda_12.6.r12.6/compiler.35059454_0
CUDA compiler flags: -O3 -DNDEBUG
CUDA driver:         12.60
CUDA runtime:        12.60

• cat run_somd.sh

#!/bin/bash
#SBATCH -o somd-array-gpu-%A.%a.out
#SBATCH -n 1
#SBATCH --time 24:00:00
#SBATCH --gres=gpu:1

lam=$1
echo "lambda is: " $lam

srun somd-freenrg -C somd.cfg -l $lam -p CUDA

• a3fe script: run_a3fe.py

import a3fe as a3
calc = a3.Calculation(ensemble_size = 5)
calc.setup()
# Get optimised lambda schedule with thermodynamic speed
# of 2 kcal mol-1
calc.get_optimal_lam_vals(delta_er = 2)
# Run adaptively with a runtime constant of 0.0005 kcal**2 mol-2 ns**-1
# Note that automatic equilibration detection with the paired t-test
# method will also be carried out.
calc.run(adaptive=True, runtime_constant = 0.0005)
calc.wait()
calc.analyse()
calc.save()

Observed Behavior

• When a3fe begins to enter the ensemble equilibration step, the GPU load drops sharply.
check the slurm job:

scontrol show jobs 46054
JobId=46054 JobName=ensemble_equil_bound.sh
   UserId=gkxiao(997) GroupId=gkxiao(984) MCS_label=N/A
   Priority=1 Nice=0 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:36:28 TimeLimit=1-00:00:00 TimeMin=N/A
   SubmitTime=2025-06-12T15:14:11 EligibleTime=2025-06-12T15:14:11
   AccrueTime=2025-06-12T15:14:11
   StartTime=2025-06-12T15:14:12 EndTime=2025-06-13T15:14:12 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2025-06-12T15:14:12 Scheduler=Main
   Partition=batch AllocNode:Sid=master:250144
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=master
   BatchHost=master
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=1M,node=1,billing=1,gres/gpu=1
   AllocTRES=cpu=1,node=1,billing=1,gres/gpu=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2/ensemble_equil_bound.sh
   WorkDir=/public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2
   StdErr=/public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2/somd-array-gpu-46054.4294967294.out
   StdIn=/dev/null
   StdOut=/public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2/somd-array-gpu-46054.4294967294.out
   Power=
   TresPerNode=gres/gpu:1

check the slurm task:

cat bound/ensemble_equilibration_2/ensemble_equil_bound.sh
#!/bin/bash
#SBATCH -o somd-array-gpu-%A.%a.out
#SBATCH -n 1
#SBATCH --time 24:00:00
#SBATCH --gres=gpu:1

python -c 'from a3fe.run.system_prep import slurm_ensemble_equilibration_bound; slurm_ensemble_equilibration_bound()'

• Two gmx mdrun processes each consuming ​​~32 CPU cores​​ (3245% CPU usage via top).

Tasks: 1511 total,   4 running, 1504 sleeping,   0 stopped,   3 zombie
%Cpu(s): 50.9 us,  0.1 sy,  0.0 ni, 49.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  6.1/257752.1 [||||||                                                                                              ]
MiB Swap:  0.0/8192.0   [                                                                                                    ]

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
2909717 gkxiao    20   0 9953.0m 372272 145244 R  3245   0.1      6,09 /usr/local/gromacs/bin/gmx mdrun -deffnm gromacs -c /public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_2/gromacs_out.gro
2909711 gkxiao    20   0 9941.9m 292484 142192 R  3242   0.1      6,21 /usr/local/gromacs/bin/gmx mdrun -deffnm gromacs -c /public/gkxiao/software/a3fe/4zlz_gaff2/bound/ensemble_equilibration_1/gromacs_out.gro

• GPUs at ​​1% utilization​​ with minimal VRAM usage (392MB/24GB per GPU via nvidia-smi).

Thu Jun 12 15:20:33 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      On  |   00000000:01:00.0 Off |                  Off |
| 30%   48C    P0             64W /  425W |     415MiB /  24564MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090 D      On  |   00000000:41:00.0 Off |                  Off |
| 30%   53C    P0             61W /  425W |     415MiB /  24564MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      4636      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A   2909711      C   /usr/local/gromacs/bin/gmx                    392MiB |
|    1   N/A  N/A      4636      G   /usr/lib/xorg/Xorg                              4MiB |
|    1   N/A  N/A   2909717      C   /usr/local/gromacs/bin/gmx                    392MiB |
+-----------------------------------------------------------------------------------------+

Expected Outcome

Implementing these changes should:
• Raise GPU utilization to ​​>90%​​ .
• Reduce CPU core usage per process to ​​<16 cores​​, balancing workload.
• Improve simulation throughput by ​​5–10×​​ based on GROMACS benchmarks.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions