feat: gate NVIDIA IMEX enablement to GB200/GB300 NVLink systems#85
Merged
richm merged 1 commit intolinux-system-roles:mainfrom Feb 27, 2026
Merged
Conversation
Reviewer's GuideAdds optional, hardware-gated installation and activation of the NVIDIA IMEX service for GB200/GB300 NVLink systems, plus defaults and documentation for the new behavior. Sequence diagram for NVLink health check with NVIDIA IMEX enabledsequenceDiagram
participant HealthAgent
participant NvidiaImexService
participant NvidiaDriver
participant NvlinkFabric
HealthAgent->>NvidiaImexService: Query IMEX domain status
NvidiaImexService->>NvidiaDriver: Request NVLink fabric topology and status
NvidiaDriver->>NvlinkFabric: Probe connectivity and domain state
NvlinkFabric-->>NvidiaDriver: Connectivity table and domain state UP
NvidiaDriver-->>NvidiaImexService: Normalized status (Domain State: UP)
NvidiaImexService-->>HealthAgent: IMEX domain status HEALTHY
HealthAgent-->>HealthAgent: Report BackgroundGPUHealthChecks OK for NvLink
Class diagram for new NVIDIA IMEX configuration variablesclassDiagram
class HpcRoleDefaults {
bool hpc_install_cuda_driver = true
bool hpc_install_cuda_toolkit = true
bool hpc_install_hpc_nvidia_nccl = true
bool hpc_install_nvidia_fabric_manager = true
bool hpc_install_nvidia_imex = true
bool hpc_install_rdma = true
bool hpc_enable_azure_persistent_rdma_naming = true
bool hpc_install_system_openmpi = true
}
class HpcRoleVars {
list~string~ __hpc_cuda_driver_packages
list~string~ __hpc_nvidia_fabric_manager_packages
string __hpc_nvidia_imex_package = nvidia-imex
list~string~ __hpc_nvidia_container_toolkit_packages
list~string~ __hpc_rdma_packages
}
class NvidiaImexTaskBlock {
bool condition_hpc_install_nvidia_imex
string condition_system_vendor
string fact___hpc_imex_gpu_names
void detect_gpu_names()
void install_imex_package()
void enable_and_start_imex_service()
}
HpcRoleDefaults "1" o-- "1" HpcRoleVars : provides_defaults_for
HpcRoleDefaults "1" o-- "1" NvidiaImexTaskBlock : controls_via_hpc_install_nvidia_imex
HpcRoleVars "1" o-- "1" NvidiaImexTaskBlock : uses___hpc_nvidia_imex_package
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
cc5eb0b to
a3fe355
Compare
spetrosi
approved these changes
Feb 26, 2026
richm
reviewed
Feb 26, 2026
a3fe355 to
c8714ba
Compare
Add NVIDIA IMEX integration to support runtime NVLink switch-fabric (re-)configuration where applicable. Install nvidia-imex (configurable via __hpc_nvidia_imex_package) and enable nvidia-imex.service. This role installs and enables the nvidia-imex service but does not start it immediately. The service is configured to launch at boot only on compatible multi-node NVLink switch-fabric systems, such as NVIDIA GB200 or GB300 (NVL72) racks. Update README.md to document the IMEX behavior and requirement expectations (CycleCloud HealthAgent). Signed-off-by: Gaurav Goklani <ggoklani@redhat.com>
b767101 to
d786130
Compare
richm
approved these changes
Feb 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Enhancement:
4.Update README.md to document the IMEX behavior and requirement expectations (CycleCloud HealthAgent).
Reason:
on GB200/GB300 NVLink systems cyclecloud healthagent fails with below error on NVIDIA GB200 or GB300 (NVL72) racks.
"BackgroundGPUHealthChecks": {
"status": "Error",
"message": "BackgroundGPUHealthChecks reports errors",
"description": "BackgroundGPUHealthChecks report Error count=1 subsystem=NvLink",
"details": "IMEX domain status is DEGRADED (unhealthy) Check IMEX installation, configuration, domain and daemon status, and network connectivity.",
"last_update": "2026-02-26T08:06:00 UTC",
"categories": [
"NvLink"
Result:
[ggoklani@gaurav-test-gpu-1 ~]$ sudo systemctl status nvidia-imex.service
● nvidia-imex.service - NVIDIA IMEX service
Loaded: loaded (/usr/lib/systemd/system/nvidia-imex.service; enabled; preset: enabled)
Active: active (running) since Thu 2026-02-26 07:51:49 UTC; 4s ago
Process: 27693 ExecStart=/usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg (code=exited, status=0/SUCCESS)
Main PID: 27695 (nvidia-imex)
Tasks: 31 (limit: 3355442)
Memory: 12.9M
CPU: 28ms
CGroup: /system.slice/nvidia-imex.service
└─27695 /usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg
Feb 26 07:51:49 gaurav-test-gpu-1 systemd[1]: Starting NVIDIA IMEX service...
Feb 26 07:51:49 gaurav-test-gpu-1 systemd[1]: Started NVIDIA IMEX service.
[ggoklani@gaurav-test-gpu-1 ~]$ nvidia-imex-ctl -N
Connectivity Table Legend:
I - Invalid - Node wasn't reachable, no connection status available
N - Never Connected
R - Recovering - Connection was lost, but clean up has not yet been triggered.
D - Disconnected - Connection was lost, and clean up has been triggreed.
A - Authenticating - If GSSAPI enabled, client has initiated mutual authentication.
!V! - Version mismatch, communication disabled.
!M! - Node map mismatch, communication disabled.
!A! - Authentication error, communication disabled.
!R! - Not yet connected, but blocking the service from proceeding past WAITING_FOR_RECOVERY.
C - Connected - Ready for operation
2/26/2026 07:52:10.366
Nodes:
Node #0 * 172.17.1.5 * - READY - Version: 580.126.20
Node #1 - 172.17.1.6 - READY - Version: 580.126.20
Nodes From\To 0 1
0 C C
1 C C
Domain State: UP
[ggoklani@gaurav-test-gpu-1 ~]$
Issue Tracker Tickets (Jira or BZ if any): https://issues.redhat.com/browse/RHELHPC-160
Summary by Sourcery
Add gated NVIDIA IMEX support for GB200/GB300 NVLink multi-node systems and document the new behavior and configuration toggle.
New Features:
Enhancements:
Documentation:
Summary by Sourcery
Add optional, hardware-gated NVIDIA IMEX support for NVLink multi-node GPU systems and document its configuration.
New Features:
Enhancements:
Documentation: