Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,16 @@ Default: `true`

Type: `bool`

### hpc_install_nvidia_imex

Whether to install NVIDIA IMEX (`nvidia-imex`) and enable `nvidia-imex.service`.

Note: "This role installs and enables the nvidia-imex service but does not start it immediately. The service is configured to launch at boot only on compatible multi-node NVLink switch-fabric systems, such as NVIDIA GB200 or GB300 (NVL72) racks."

Default: `true`

Type: `bool`

### hpc_install_rdma

Whether to install the NVIDIA RDMA package.
Expand Down
1 change: 1 addition & 0 deletions defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ hpc_install_cuda_driver: true
hpc_install_cuda_toolkit: true
hpc_install_hpc_nvidia_nccl: true
hpc_install_nvidia_fabric_manager: true
hpc_install_nvidia_imex: true
hpc_install_rdma: true
hpc_enable_azure_persistent_rdma_naming: true
hpc_install_system_openmpi: true
Expand Down
17 changes: 17 additions & 0 deletions tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -527,6 +527,23 @@
name: nvidia-fabricmanager
enabled: true

- name: Install/enable NVIDIA IMEX (NVLink multi-node runtime)
when:
- hpc_install_nvidia_imex
- ansible_facts["system_vendor"] == "Microsoft Corporation"
block:
- name: Install NVIDIA IMEX
package:
name: "{{ __hpc_nvidia_imex_package }}"
state: present
use: "{{ (__hpc_server_is_ostree | d(false)) |
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"

- name: Enable NVIDIA IMEX service
systemd:
name: nvidia-imex.service
enabled: true

- name: Install RDMA packages
when: hpc_install_rdma
block:
Expand Down
1 change: 1 addition & 0 deletions vars/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ __hpc_cuda_driver_packages:
- cuda-drivers
__hpc_nvidia_fabric_manager_packages:
- nvidia-fabric-manager
__hpc_nvidia_imex_package: nvidia-imex
__hpc_nvidia_container_toolkit_packages:
- nvidia-container-toolkit
__hpc_rdma_packages:
Expand Down
Loading