Skip to content

[Issue]: Unable to Use torchrun -m omnitrace on rocm6.4 rhel 9.1 using torchrun #455

@zixianwang2022

Description

@zixianwang2022

Problem Description

Hi, I am trying to run my workload using Megatron-DeepSpeed:

Here is how I launched my run with omnitrace:

torchrun --nnodes 1 --nproc_per_node ${NUM_GPUS} -m omnitrace  ../pretrain_gpt_deepspeed.py \
        ${megatron_options} \
        ${data_options} \
        ${deepspeed_options} \
        --master-addr=$MASTER_ADDR \
        --zero-stage 1 \
        2>&1 | tee n${NUM_GPUS}-Small-XMoE-batch${BATCH_SIZE}.log

However, it says the following error:

Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load. Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load. Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load. Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load. Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load. Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load. Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load. Tool lib "/home1/zixianw4/omnitrace/lib/python/site-packages/omnitrace/../../../libomnitrace.so" failed to load.

Because I am using amd's hpcfund, I don't have access to write /opt, so I mkdir omnitrace and save everything in my home.

Here is how I download omnitrace:

I tried downloading omnitrace from source here: https://github.com/ROCm/omnitrace/releases/tag/rocm-6.2.4.

I found that hpcfund is using rocm 6.4, rocky linux 9.1. However, the closes release I found is "omnitrace-1.11.2-rhel-9.2-ROCm-60000-PAPI-OMPT-Python3.sh". So I used it instead. I also tried rhel9.1, but it failed either.

Then I did the following according to this video and this document:

mkdir ~/omnitrace
cd ~/omnitrace

(xmoe) [zixianw4@k004-001 omnitrace]$ wget https://github.com/ROCm/omnitrace/releases/download/rocm-6.2.4/omnitrace-1.11.2-rhel-9.2-ROCm-60000-PAPI-OMPT-Python3.sh
(xmoe) [zixianw4@k004-001 omnitrace]$ bash omnitrace-1.11.2-rhel-9.2-ROCm-60000-PAPI-OMPT-Python3.sh 

xmoe) [zixianw4@k004-001 omnitrace]$ module use share/modulefiles/
(xmoe) [zixianw4@k004-001 omnitrace]$ module load omnitrace
omnitrace         omnitrace/1.11.2  
(xmoe) [zixianw4@k004-001 omnitrace]$ module load omnitrace/1.11.2 
(xmoe) [zixianw4@k004-001 omnitrace]$ source share/omnitrace/setup-env.sh 

I am able to call out

omnitrace-instrument --help

But it gives me error for using omnitrace-avail --help:

omnitrace-avail --help
omnitrace-avail: error while loading shared libraries: librocm_smi64.so.6: cannot open shared object file: No such file or directory

Operating System

NAME="Rocky Linux" VERSION="9.1 (Blue Onyx)"

CPU

AMD EPYC 7763 64-Core Processor

GPU

MI250

ROCm Version

ROCm 6.4

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions