-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Problem Description
Hi, I am trying to run my workload using Megatron-DeepSpeed:
Here is how I launched my run with omnitrace:
torchrun --nnodes 1 --nproc_per_node ${NUM_GPUS} -m omnitrace ../pretrain_gpt_deepspeed.py \
${megatron_options} \
${data_options} \
${deepspeed_options} \
--master-addr=$MASTER_ADDR \
--zero-stage 1 \
2>&1 | tee n${NUM_GPUS}-Small-XMoE-batch${BATCH_SIZE}.log
However, it says the following error:
Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load. Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load. Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load. Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load. Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load. Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load. Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load. Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load.
Because I am using amd's hpcfund, I don't have access to write /opt, so I mkdir omnitrace and save everything in my home.
Here is how I download omnitrace:
I tried downloading omnitrace from source here: https://github.com/ROCm/omnitrace/releases/tag/rocm-6.2.4.
I found that hpcfund is using rocm 6.4, rocky linux 9.1. However, the closes release I found is "omnitrace-1.11.2-rhel-9.2-ROCm-60000-PAPI-OMPT-Python3.sh". So I used it instead. I also tried rhel9.1, but it failed either.
Then I did the following according to this video and this document:
mkdir ~/omnitrace
cd ~/omnitrace
(xmoe) [zixianw4@k004-001 omnitrace]$ wget https://github.com/ROCm/omnitrace/releases/download/rocm-6.2.4/omnitrace-1.11.2-rhel-9.2-ROCm-60000-PAPI-OMPT-Python3.sh
(xmoe) [zixianw4@k004-001 omnitrace]$ bash omnitrace-1.11.2-rhel-9.2-ROCm-60000-PAPI-OMPT-Python3.sh
xmoe) [zixianw4@k004-001 omnitrace]$ module use share/modulefiles/
(xmoe) [zixianw4@k004-001 omnitrace]$ module load omnitrace
omnitrace omnitrace/1.11.2
(xmoe) [zixianw4@k004-001 omnitrace]$ module load omnitrace/1.11.2
(xmoe) [zixianw4@k004-001 omnitrace]$ source share/omnitrace/setup-env.sh
I am able to call out
omnitrace-instrument --help
But it gives me error for using omnitrace-avail --help:
omnitrace-avail --help
omnitrace-avail: error while loading shared libraries: librocm_smi64.so.6: cannot open shared object file: No such file or directory
Operating System
NAME="Rocky Linux" VERSION="9.1 (Blue Onyx)"
CPU
AMD EPYC 7763 64-Core Processor
GPU
MI250
ROCm Version
ROCm 6.4
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response