Skip to content

Conversation

@jsl9208
Copy link

@jsl9208 jsl9208 commented Jan 29, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

NVIDIA GB10 (DGX Spark) uses a unified memory architecture where CPU and GPU share the same physical memory. On these GPUs, nvmlDeviceGetMemoryInfo() returns ERROR_NOT_SUPPORTED instead of memory information. The current code treats any non-SUCCESS return as fatal and calls panic(0), which crashes the entire device plugin daemonset and prevents the node from registering any GPU devices.

This PR handles ERROR_NOT_SUPPORTED gracefully by:

  1. Device registration (register.go): Falls back to a new defaultDeviceMemory config value (in MiB). If not configured, the device is skipped with an error log instead of panicking.
  2. Metrics collection (metrics.go): Skips memory metrics for unsupported devices instead of returning an error every scrape cycle.
  3. Config plumbing: Adds defaultDeviceMemory field to NvidiaConfig, Helm chart configmap, and values.yaml.

Which issue(s) this PR fixes:

Fixes #1511

Special notes for your reviewer:

AI assistance disclosure: This PR was developed with Claude Code assisting in code analysis and review. The fix was designed, implemented, and validated by a human on real GB10 hardware.

Tested on real hardware: NVIDIA DGX Spark (GB10, ARM64, Driver 580.95.05, CUDA 13.0, Ubuntu 24.04, K8s v1.34.1).

Scenario Result
Unpatched v2.8.0 on GB10 panic: 0 at register.go:115
Patched with defaultDeviceMemory: 131072 Device registered, pod scheduled, vGPU isolation works ✅
Patched without defaultDeviceMemory Device skipped gracefully, no crash ✅

Usage — for unified memory GPUs, set defaultDeviceMemory to the total GPU memory in MiB:

# values.yaml
devicePlugin:
  defaultDeviceMemory: 131072  # 128 GiB for GB10

Without this config, the device will be skipped (not registered) but the plugin won't crash.

Does this PR introduce a user-facing change?:

Yes. Adds a new optional Helm value devicePlugin.defaultDeviceMemory (default: 0). Only needed for GPUs with unified memory architecture (e.g., NVIDIA GB10/DGX Spark) where nvmlDeviceGetMemoryInfo() is not supported. When set, the device plugin uses this value as the total device memory fallback instead of panicking.

@hami-robot hami-robot bot added kind/bug Something isn't working dco-signoff: no labels Jan 29, 2026
@hami-robot hami-robot bot requested a review from chaunceyjiang January 29, 2026 12:25
@hami-robot
Copy link
Contributor

hami-robot bot commented Jan 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jsl9208
Once this PR has been reviewed and has the lgtm label, please assign shouren for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot hami-robot bot requested a review from wawa0210 January 29, 2026 12:25
@hami-robot hami-robot bot added the size/S label Jan 29, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jsl9208, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug where the device plugin would crash when encountering NVIDIA GPUs with unified memory architectures (like GB10/DGX Spark) due to nvmlDeviceGetMemoryInfo() returning ERROR_NOT_SUPPORTED. The fix introduces a robust error handling mechanism, allowing the plugin to either use a user-defined default memory value or gracefully skip the device, ensuring stability and preventing daemonset panics. It also refines metrics collection for such devices and provides a new Helm configuration option for seamless deployment.

Highlights

  • Unified Memory GPU Support: The device plugin now gracefully handles ERROR_NOT_SUPPORTED from nvmlDeviceGetMemoryInfo() on unified memory GPUs (e.g., NVIDIA GB10), preventing crashes.
  • Configurable Memory Fallback: A new defaultDeviceMemory configuration option allows users to specify the total GPU memory in MiB for devices where GetMemoryInfo is not supported.
  • Graceful Device Handling: If defaultDeviceMemory is not configured for an unsupported device, the device is now skipped with an error log instead of causing the entire daemonset to panic.
  • Metrics Collection Improvement: Memory metrics collection now skips devices that return ERROR_NOT_SUPPORTED, avoiding repeated errors during Prometheus scrapes.
  • Helm Chart Integration: The defaultDeviceMemory field has been integrated into the Helm chart's values.yaml and device-configmap.yaml for easy deployment and configuration.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a crash in the device plugin when running on GPUs with a unified memory architecture, such as the NVIDIA GB10. The changes gracefully handle the ERROR_NOT_SUPPORTED return code by introducing a fallback to a configurable defaultDeviceMemory value, which is a solid approach. The modifications to the metrics collection and Helm charts are also correct and complete the feature. I've included one high-severity suggestion to prevent a potential integer overflow and one medium-severity suggestion to improve logging for better debuggability.

Comment on lines +111 to 130
switch ret {
case nvml.SUCCESS:
memoryTotal = int(memory.Total)
} else {
case nvml.ERROR_NOT_SUPPORTED:
// Unified memory architecture GPUs (e.g., NVIDIA GB10/DGX Spark) don't support
// traditional memory queries. Use DefaultDeviceMemory from config as fallback.
if plugin.schedulerConfig.DefaultDeviceMemory > 0 {
memoryTotal = int(plugin.schedulerConfig.DefaultDeviceMemory) * 1024 * 1024
klog.Warningf("GetMemoryInfo not supported for device %s, using configured DefaultDeviceMemory: %d MB",
UUID, plugin.schedulerConfig.DefaultDeviceMemory)
} else {
klog.Errorf("GetMemoryInfo not supported for device %s (unified memory architecture) "+
"and DefaultDeviceMemory not configured. Skipping this device. "+
"Set 'defaultDeviceMemory' in nvidia config to the total GPU memory in MB.", UUID)
continue
}
default:
klog.Error("nvml get memory error ret=", ret)
panic(0)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable memoryTotal is inferred as int from its declaration on line 109. On a 32-bit system, this can lead to an integer overflow when handling GPUs with large memory, as both memory.Total (a uint64) and the calculation for defaultDeviceMemory can exceed the capacity of a 32-bit integer. This could lead to incorrect memory registration or panics.

To prevent this, memoryTotal should be declared as uint64 on line 109. Consequently, the assignments within this switch block should be updated to use uint64 as well (e.g., memoryTotal = memory.Total and memoryTotal = uint64(plugin.schedulerConfig.DefaultDeviceMemory) * 1024 * 1024).

Since line 109 is outside the diff, I'm providing this as a general comment on the block. Please consider applying this change for robustness.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pre-existing code outside this diff. Keeping this PR minimal to the bugfix scope.

continue
}
default:
klog.Error("nvml get memory error ret=", ret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For more informative logging, it's better to log the error string corresponding to the nvml.Return code instead of just the integer value. The nvml.ErrorString() function can be used for this. This will make debugging easier.

Suggested change
klog.Error("nvml get memory error ret=", ret)
klog.Errorf("nvml get memory error: %s", nvml.ErrorString(ret))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, out of scope for this fix.

On NVIDIA GB10 (DGX Spark) and other unified memory architecture GPUs,
nvmlDeviceGetMemoryInfo() returns ERROR_NOT_SUPPORTED, causing the device
plugin to panic.

Changes:
- register.go: Handle ERROR_NOT_SUPPORTED by using DefaultDeviceMemory
  config as fallback. Skip device gracefully (continue) instead of
  panic when config is not set.
- metrics.go: Skip memory metrics collection for unsupported devices.
- device.go: Add DefaultDeviceMemory field to NvidiaConfig.
- charts: Plumb defaultDeviceMemory through Helm values and ConfigMap.

Fixes: Project-HAMi#1511
Signed-off-by: jsl9208 <shilong@heywhale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Device plugin panics on NVIDIA GB10 (DGX Spark) - GetMemoryInfo returns "Not Supported"

1 participant