Skip to content

feat: gate NVIDIA IMEX enablement to GB200/GB300 NVLink systems#85

Merged
richm merged 1 commit intolinux-system-roles:mainfrom
ggoklani:install_nvidia_imex
Feb 27, 2026
Merged

feat: gate NVIDIA IMEX enablement to GB200/GB300 NVLink systems#85
richm merged 1 commit intolinux-system-roles:mainfrom
ggoklani:install_nvidia_imex

Conversation

@ggoklani
Copy link
Collaborator

@ggoklani ggoklani commented Feb 26, 2026

Enhancement:

  1. Add NVIDIA IMEX integration to support runtime NVLink switch-fabric (re-)configuration where applicable.
  2. Install nvidia-imex (configurable via __hpc_nvidia_imex_package) and enable nvidia-imex.service.
  3. This role installs and enables the nvidia-imex service but does not start it immediately. The service is configured to launch at boot only on compatible multi-node NVLink switch-fabric systems, such as NVIDIA GB200 or GB300 (NVL72) racks.
    4.Update README.md to document the IMEX behavior and requirement expectations (CycleCloud HealthAgent).

Reason:
on GB200/GB300 NVLink systems cyclecloud healthagent fails with below error on NVIDIA GB200 or GB300 (NVL72) racks.

"BackgroundGPUHealthChecks": {
            "status": "Error",
            "message": "BackgroundGPUHealthChecks reports errors",
            "description": "BackgroundGPUHealthChecks report Error count=1 subsystem=NvLink",
            "details": "IMEX domain status is DEGRADED (unhealthy) Check IMEX installation, configuration, domain and daemon status, and network connectivity.",
            "last_update": "2026-02-26T08:06:00 UTC",
            "categories": [
                "NvLink"

Result:

[ggoklani@gaurav-test-gpu-1 ~]$ sudo systemctl status nvidia-imex.service
● nvidia-imex.service - NVIDIA IMEX service
     Loaded: loaded (/usr/lib/systemd/system/nvidia-imex.service; enabled; preset: enabled)
     Active: active (running) since Thu 2026-02-26 07:51:49 UTC; 4s ago
    Process: 27693 ExecStart=/usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg (code=exited, status=0/SUCCESS)
   Main PID: 27695 (nvidia-imex)
      Tasks: 31 (limit: 3355442)
     Memory: 12.9M
        CPU: 28ms
     CGroup: /system.slice/nvidia-imex.service
             └─27695 /usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg

Feb 26 07:51:49 gaurav-test-gpu-1 systemd[1]: Starting NVIDIA IMEX service...
Feb 26 07:51:49 gaurav-test-gpu-1 systemd[1]: Started NVIDIA IMEX service.
[ggoklani@gaurav-test-gpu-1 ~]$ nvidia-imex-ctl -N
Connectivity Table Legend:
I - Invalid - Node wasn't reachable, no connection status available
N - Never Connected
R - Recovering - Connection was lost, but clean up has not yet been triggered.
D - Disconnected - Connection was lost, and clean up has been triggreed.
A - Authenticating - If GSSAPI enabled, client has initiated mutual authentication.
!V! - Version mismatch, communication disabled.
!M! - Node map mismatch, communication disabled.
!A! - Authentication error, communication disabled.
!R! - Not yet connected, but blocking the service from proceeding past WAITING_FOR_RECOVERY.
C - Connected - Ready for operation

2/26/2026 07:52:10.366
Nodes:
Node #0   * 172.17.1.5 *       - READY                - Version: 580.126.20
Node #1   - 172.17.1.6         - READY                - Version: 580.126.20

 Nodes From\To  0   1
       0        C   C
       1        C   C
Domain State: UP
[ggoklani@gaurav-test-gpu-1 ~]$

Issue Tracker Tickets (Jira or BZ if any): https://issues.redhat.com/browse/RHELHPC-160

Summary by Sourcery

Add gated NVIDIA IMEX support for GB200/GB300 NVLink multi-node systems and document the new behavior and configuration toggle.

New Features:

  • Introduce optional installation and activation of the NVIDIA IMEX service on supported NVLink multi-node GPU systems, controlled by a new hpc_install_nvidia_imex flag.

Enhancements:

  • Automatically detect GB200/GB300 GPUs via nvidia-smi and restrict IMEX installation and service enablement to hardware that supports NVLink switch-fabric operation.

Documentation:

  • Document the hpc_install_nvidia_imex variable, its default value, and the hardware conditions under which IMEX is installed and started.

Summary by Sourcery

Add optional, hardware-gated NVIDIA IMEX support for NVLink multi-node GPU systems and document its configuration.

New Features:

  • Introduce a configurable flag to install NVIDIA IMEX and manage the nvidia-imex.service for supported systems.

Enhancements:

  • Gate NVIDIA IMEX service enablement and startup based on detected GB200/GB300 GPU models and Microsoft-hosted environment.

Documentation:

  • Document the hpc_install_nvidia_imex variable, its default behavior, and the hardware conditions under which NVIDIA IMEX is installed and started.

@sourcery-ai
Copy link

sourcery-ai bot commented Feb 26, 2026

Reviewer's Guide

Adds optional, hardware-gated installation and activation of the NVIDIA IMEX service for GB200/GB300 NVLink systems, plus defaults and documentation for the new behavior.

Sequence diagram for NVLink health check with NVIDIA IMEX enabled

sequenceDiagram
  participant HealthAgent
  participant NvidiaImexService
  participant NvidiaDriver
  participant NvlinkFabric

  HealthAgent->>NvidiaImexService: Query IMEX domain status
  NvidiaImexService->>NvidiaDriver: Request NVLink fabric topology and status
  NvidiaDriver->>NvlinkFabric: Probe connectivity and domain state
  NvlinkFabric-->>NvidiaDriver: Connectivity table and domain state UP
  NvidiaDriver-->>NvidiaImexService: Normalized status (Domain State: UP)
  NvidiaImexService-->>HealthAgent: IMEX domain status HEALTHY
  HealthAgent-->>HealthAgent: Report BackgroundGPUHealthChecks OK for NvLink
Loading

Class diagram for new NVIDIA IMEX configuration variables

classDiagram
  class HpcRoleDefaults {
    bool hpc_install_cuda_driver = true
    bool hpc_install_cuda_toolkit = true
    bool hpc_install_hpc_nvidia_nccl = true
    bool hpc_install_nvidia_fabric_manager = true
    bool hpc_install_nvidia_imex = true
    bool hpc_install_rdma = true
    bool hpc_enable_azure_persistent_rdma_naming = true
    bool hpc_install_system_openmpi = true
  }

  class HpcRoleVars {
    list~string~ __hpc_cuda_driver_packages
    list~string~ __hpc_nvidia_fabric_manager_packages
    string __hpc_nvidia_imex_package = nvidia-imex
    list~string~ __hpc_nvidia_container_toolkit_packages
    list~string~ __hpc_rdma_packages
  }

  class NvidiaImexTaskBlock {
    bool condition_hpc_install_nvidia_imex
    string condition_system_vendor
    string fact___hpc_imex_gpu_names
    void detect_gpu_names()
    void install_imex_package()
    void enable_and_start_imex_service()
  }

  HpcRoleDefaults "1" o-- "1" HpcRoleVars : provides_defaults_for
  HpcRoleDefaults "1" o-- "1" NvidiaImexTaskBlock : controls_via_hpc_install_nvidia_imex
  HpcRoleVars "1" o-- "1" NvidiaImexTaskBlock : uses___hpc_nvidia_imex_package
Loading

File-Level Changes

Change Details Files
Add NVIDIA IMEX installation and service management gated by hardware detection and platform vendor.
  • Wrap IMEX setup in a task block conditioned on hpc_install_nvidia_imex and Microsoft Corporation system_vendor to scope behavior to Azure-like environments.
  • Detect GPU model names at runtime via nvidia-smi --query-gpu=name and store results in __hpc_imex_gpu_names without failing the play on errors.
  • Install the NVIDIA IMEX package using the __hpc_nvidia_imex_package variable, honoring the ostree-specific package manager when applicable.
  • Enable and start nvidia-imex.service only when GPU detection succeeds and any GPU name matches GB200 or GB300 via regex search, ensuring IMEX is only activated on supported NVLink multi-node systems.
tasks/main.yml
Introduce configuration defaults and variables for NVIDIA IMEX installation.
  • Add hpc_install_nvidia_imex default variable set to true to control IMEX installation and service management.
  • Define __hpc_nvidia_imex_package as nvidia-imex to centralize the package name used by the role.
defaults/main.yml
vars/main.yml
Document the NVIDIA IMEX behavior and configuration toggle.
  • Add README documentation for hpc_install_nvidia_imex, including its purpose, default value, and conditions under which the IMEX service is enabled and started.
  • Clarify that the role installs the nvidia-imex package on all nodes when enabled but only starts the service on GB200/GB300 NVLink multi-node systems based on nvidia-smi output.
README.md

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@ggoklani ggoklani force-pushed the install_nvidia_imex branch 2 times, most recently from cc5eb0b to a3fe355 Compare February 26, 2026 11:14
@ggoklani ggoklani marked this pull request as ready for review February 26, 2026 12:06
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've reviewed your changes and they look great!


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@ggoklani ggoklani force-pushed the install_nvidia_imex branch from a3fe355 to c8714ba Compare February 27, 2026 05:18
Add NVIDIA IMEX integration to support runtime NVLink switch-fabric (re-)configuration where applicable.

Install nvidia-imex (configurable via __hpc_nvidia_imex_package) and enable nvidia-imex.service.

This role installs and enables the nvidia-imex service but does not start it immediately. The service is configured to launch at boot only on compatible
multi-node NVLink switch-fabric systems, such as NVIDIA GB200 or GB300 (NVL72) racks.

Update README.md to document the IMEX behavior and requirement expectations (CycleCloud HealthAgent).

Signed-off-by: Gaurav Goklani <ggoklani@redhat.com>
@ggoklani ggoklani force-pushed the install_nvidia_imex branch from b767101 to d786130 Compare February 27, 2026 08:07
@richm richm merged commit 64888ef into linux-system-roles:main Feb 27, 2026
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants