Skip to content

test: add testing Nvidia docker container script#89

Merged
lixuemin2016 merged 1 commit intolinux-system-roles:mainfrom
lixuemin2016:testdocker
Mar 3, 2026
Merged

test: add testing Nvidia docker container script#89
lixuemin2016 merged 1 commit intolinux-system-roles:mainfrom
lixuemin2016:testdocker

Conversation

@lixuemin2016
Copy link
Collaborator

@lixuemin2016 lixuemin2016 commented Mar 3, 2026

Enhancement:
Add Nvidia Docker container GPU access validation as below:

  • Add test packages installation for moby-engine,moby-containerd and nvidia-container-toolkit
  • Validate that containerd and docker services are active
  • Detect GPU hardware presence before running GPU container test
  • Skip GPU container test if no GPU is detected from VM
  • Run NVIDIA Docker container to verify GPU access

Reason:
Add Nvidia Docker container GPU related tests

Result:
For the instance without GPU: e.g. Standard D2ds v4
`bash test-nvidia-docker.sh
[2026-03-03 01:54:03] ==========================================================
[2026-03-03 01:54:03] NVIDIA Container Runtime Test
[2026-03-03 01:54:03] ==========================================================

[2026-03-03 01:54:03] Test: Container runtime packages installation...

Checking: moby-engine package is installed
[PASS] moby-engine package is installed
Checking: moby-containerd package is installed
[PASS] moby-containerd package is installed
Checking: nvidia-container-toolkit package is installed
[PASS] nvidia-container-toolkit package is installed

[2026-03-03 01:54:03] ==========================================
[2026-03-03 01:54:03] Service Status Tests
[2026-03-03 01:54:03] ==========================================

[2026-03-03 01:54:03] Test: containerd service status...

Checking: containerd service is active
[PASS] containerd service is active

[2026-03-03 01:54:03] Test: Docker service status...

Checking: Docker service is active
[PASS] Docker service is active

[2026-03-03 01:54:03] ==========================================
[2026-03-03 01:54:03] GPU Access Test
[2026-03-03 01:54:03] ==========================================

[2026-03-03 01:54:03] Detecting GPU hardware...

Checking: GPU hardware presence
[INFO] nvidia-smi command failed - no GPU hardware detected

[2026-03-03 01:54:03] Test: GPU access in Docker container...

[SKIP] No GPU hardware detected - cannot test GPU access
`

If have GPU access, e.g. test on instance Standard NC4as T4 v3, will get PASS result.
`
[2026-03-03 02:28:57] ==========================================
[2026-03-03 02:28:57] GPU Access Test
[2026-03-03 02:28:57] ==========================================

[2026-03-03 02:28:57] Detecting GPU hardware...

Checking: GPU hardware presence
GPU detected

[2026-03-03 02:28:57] Test: GPU access in Docker container...

Checking: NVIDIA runtime is registered in Docker
[PASS] NVIDIA runtime is registered in Docker
Checking: GPU is accessible from Docker container
[PASS] GPU is accessible from Docker container

`

Issue Tracker Tickets (Jira or BZ if any):
JIRA: RHELHPC-120

Summary by Sourcery

Add a scripted NVIDIA Docker GPU validation test to the Azure HPC role and install it on configured systems.

New Features:

  • Introduce a test script to validate NVIDIA Docker GPU access, including package checks, service status, and in-container GPU visibility.

Tests:

  • Add an executable NVIDIA container runtime test script that verifies required packages, ensures Docker and containerd services are active, detects GPU presence, and conditionally runs a GPU-enabled Docker container.

@sourcery-ai
Copy link

sourcery-ai bot commented Mar 3, 2026

Reviewer's Guide

Adds an NVIDIA GPU validation test script and wires it into the Azure HPC role so GPU-enabled instances can be automatically validated via a Docker-based CUDA container check, while skipping GPU tests on non-GPU hosts.

Sequence diagram for NVIDIA Docker GPU validation script execution

sequenceDiagram
    actor User
    participant TestScript as test_nvidia_docker_sh
    participant OS
    participant Systemd
    participant Docker
    participant Containerd
    participant GPU

    User->>TestScript: Execute test_nvidia_docker_sh

    rect rgb(235, 235, 255)
        TestScript->>OS: Check moby-engine package
        OS-->>TestScript: Installed
        TestScript->>OS: Check moby-containerd package
        OS-->>TestScript: Installed
        TestScript->>OS: Check nvidia-container-toolkit package
        OS-->>TestScript: Installed
    end

    rect rgb(235, 255, 235)
        TestScript->>Systemd: Query containerd service status
        Systemd-->>TestScript: containerd active

        TestScript->>Systemd: Query docker service status
        Systemd-->>TestScript: docker active
    end

    rect rgb(255, 245, 235)
        TestScript->>GPU: Run nvidia_smi
        alt GPU not present or nvidia_smi fails
            GPU-->>TestScript: Error or no device
            TestScript-->>User: Log skip GPU access test
        else GPU present
            GPU-->>TestScript: GPU info
            TestScript->>Docker: Check NVIDIA runtime registration
            Docker-->>TestScript: NVIDIA runtime available

            TestScript->>Docker: Run CUDA container with GPU access
            Docker->>GPU: Expose GPU to container
            GPU-->>Docker: GPU accessible
            Docker-->>TestScript: Container exited successfully
            TestScript-->>User: Report GPU access PASS
        end
    end
Loading

Flow diagram for NVIDIA Docker GPU validation logic

flowchart TD
    A[Start test_nvidia_docker_sh] --> B[Check moby-engine installed]
    B --> C[Check moby-containerd installed]
    C --> D[Check nvidia-container-toolkit installed]
    D --> E[Check containerd service is active]
    E --> F[Check docker service is active]
    F --> G[Run nvidia-smi to detect GPU]

    G --> H{GPU detected?}

    H -- No --> I[Log no GPU detected]
    I --> J[Skip GPU access container test]
    J --> Z[End]

    H -- Yes --> K[Verify NVIDIA runtime is registered in Docker]
    K --> L[Run NVIDIA CUDA container with GPU access]
    L --> M{Container can access GPU?}

    M -- Yes --> N[Report PASS GPU accessible]
    M -- No --> O[Report FAIL GPU not accessible]

    N --> Z[End]
    O --> Z[End]
Loading

File-Level Changes

Change Details Files
Install a templated NVIDIA Docker GPU validation test script into the HPC tests directory during container runtime setup.
  • Add an Ansible template task that installs test-nvidia-docker.sh into the configured tests directory with executable permissions
  • Ensure the script deployment happens alongside existing container runtime configuration so it runs on nodes with containerd and Docker set up
tasks/main.yml
Introduce a comprehensive bash test script that validates container runtime packages, service status, GPU presence, and GPU accessibility from a Docker container using the NVIDIA runtime.
  • Implement argument parsing for verbosity and help in the test script
  • Add checks to ensure moby-engine, moby-containerd, and nvidia-container-toolkit RPMs are installed
  • Add systemd-based health checks for containerd and docker services
  • Implement GPU detection via nvidia-smi with a /dev/nvidia0 fallback and set a HAS_GPU flag
  • Skip GPU-access tests with a distinct exit code when no GPU is detected
  • Run an NVIDIA CUDA Docker image with --gpus all and validate nvidia-smi output for NVIDIA-SMI banner and CUDA version
  • Track and report passed tests and print a final summary banner on success
templates/test-nvidia-docker.sh.j2

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • Consider making the NVIDIA_IMAGE and possibly the expected Docker runtime name configurable via environment variables or script flags so the test can be reused with different CUDA images or runtime configurations without editing the script.
  • The test_nvidia_gpu_access function exits with code 77 on skip; double-check that this code is correctly interpreted as a skipped test by your surrounding harness, or align it with whatever skip convention the rest of the test suite uses.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider making the `NVIDIA_IMAGE` and possibly the expected Docker runtime name configurable via environment variables or script flags so the test can be reused with different CUDA images or runtime configurations without editing the script.
- The `test_nvidia_gpu_access` function exits with code 77 on skip; double-check that this code is correctly interpreted as a skipped test by your surrounding harness, or align it with whatever skip convention the rest of the test suite uses.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Add Nvidia Docker container GPU access validation as below:
- Add test packages installation for moby-engine,moby-containerd
  and nvidia-container-toolkit
- Validate that containerd and docker services are active
- Detect GPU hardware presence before running GPU container test
- Skip GPU container test if no GPU is detected from VM
- Run NVIDIA Docker container to verify GPU access

JIRA: RHELHPC-120

Signed-off-by: Xuemin Li <xuli@redhat.com>
@lixuemin2016 lixuemin2016 merged commit 7950260 into linux-system-roles:main Mar 3, 2026
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants