-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Suggestion Description
Currently, AMD’s official ROCm container images provide limited support for bnxt modules. This has been a friction point for customers who are used to base images working out-of-the-box on the Nvidia side with Mellanox.
Current setup docs:
https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/system-setup/multi-node-setup.html
This limits the usability of ROCm base containers for distributed workloads, as users must manually install and configure bnxt drivers or rebuild container images to support RoCE, which can require additional CI work to support different deployments.
We understand that supporting multiple RoCE implementations (Broadcom, etc.) may be non-trivial. However, even partial out-of-the-box support or the ability to select the target version via an ENV would significantly improve the usability and portability of ROCm containers for multi-node workloads.
Operating System
No response
GPU
No response
ROCm Component
No response