Describe the bug
I'm using megatron to train llama2-7b's rlhf alignment with 8 H100s, 60-core CPU, 1200g memory and got this error: Resource temporarily unavailable
Screenshots
If applicable, add screenshots to help explain your problem.
error info


gpu use info
