-
Notifications
You must be signed in to change notification settings - Fork 470
fix: handle GetMemoryInfo ERROR_NOT_SUPPORTED for unified memory GPUs #1637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jsl9208 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @jsl9208, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical bug where the device plugin would crash when encountering NVIDIA GPUs with unified memory architectures (like GB10/DGX Spark) due to Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a crash in the device plugin when running on GPUs with a unified memory architecture, such as the NVIDIA GB10. The changes gracefully handle the ERROR_NOT_SUPPORTED return code by introducing a fallback to a configurable defaultDeviceMemory value, which is a solid approach. The modifications to the metrics collection and Helm charts are also correct and complete the feature. I've included one high-severity suggestion to prevent a potential integer overflow and one medium-severity suggestion to improve logging for better debuggability.
| switch ret { | ||
| case nvml.SUCCESS: | ||
| memoryTotal = int(memory.Total) | ||
| } else { | ||
| case nvml.ERROR_NOT_SUPPORTED: | ||
| // Unified memory architecture GPUs (e.g., NVIDIA GB10/DGX Spark) don't support | ||
| // traditional memory queries. Use DefaultDeviceMemory from config as fallback. | ||
| if plugin.schedulerConfig.DefaultDeviceMemory > 0 { | ||
| memoryTotal = int(plugin.schedulerConfig.DefaultDeviceMemory) * 1024 * 1024 | ||
| klog.Warningf("GetMemoryInfo not supported for device %s, using configured DefaultDeviceMemory: %d MB", | ||
| UUID, plugin.schedulerConfig.DefaultDeviceMemory) | ||
| } else { | ||
| klog.Errorf("GetMemoryInfo not supported for device %s (unified memory architecture) "+ | ||
| "and DefaultDeviceMemory not configured. Skipping this device. "+ | ||
| "Set 'defaultDeviceMemory' in nvidia config to the total GPU memory in MB.", UUID) | ||
| continue | ||
| } | ||
| default: | ||
| klog.Error("nvml get memory error ret=", ret) | ||
| panic(0) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable memoryTotal is inferred as int from its declaration on line 109. On a 32-bit system, this can lead to an integer overflow when handling GPUs with large memory, as both memory.Total (a uint64) and the calculation for defaultDeviceMemory can exceed the capacity of a 32-bit integer. This could lead to incorrect memory registration or panics.
To prevent this, memoryTotal should be declared as uint64 on line 109. Consequently, the assignments within this switch block should be updated to use uint64 as well (e.g., memoryTotal = memory.Total and memoryTotal = uint64(plugin.schedulerConfig.DefaultDeviceMemory) * 1024 * 1024).
Since line 109 is outside the diff, I'm providing this as a general comment on the block. Please consider applying this change for robustness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pre-existing code outside this diff. Keeping this PR minimal to the bugfix scope.
| continue | ||
| } | ||
| default: | ||
| klog.Error("nvml get memory error ret=", ret) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For more informative logging, it's better to log the error string corresponding to the nvml.Return code instead of just the integer value. The nvml.ErrorString() function can be used for this. This will make debugging easier.
| klog.Error("nvml get memory error ret=", ret) | |
| klog.Errorf("nvml get memory error: %s", nvml.ErrorString(ret)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, out of scope for this fix.
On NVIDIA GB10 (DGX Spark) and other unified memory architecture GPUs, nvmlDeviceGetMemoryInfo() returns ERROR_NOT_SUPPORTED, causing the device plugin to panic. Changes: - register.go: Handle ERROR_NOT_SUPPORTED by using DefaultDeviceMemory config as fallback. Skip device gracefully (continue) instead of panic when config is not set. - metrics.go: Skip memory metrics collection for unsupported devices. - device.go: Add DefaultDeviceMemory field to NvidiaConfig. - charts: Plumb defaultDeviceMemory through Helm values and ConfigMap. Fixes: Project-HAMi#1511 Signed-off-by: jsl9208 <shilong@heywhale.com>
f186ab2 to
2f6958e
Compare
What type of PR is this?
/kind bug
What this PR does / why we need it:
NVIDIA GB10 (DGX Spark) uses a unified memory architecture where CPU and GPU share the same physical memory. On these GPUs,
nvmlDeviceGetMemoryInfo()returnsERROR_NOT_SUPPORTEDinstead of memory information. The current code treats any non-SUCCESS return as fatal and callspanic(0), which crashes the entire device plugin daemonset and prevents the node from registering any GPU devices.This PR handles
ERROR_NOT_SUPPORTEDgracefully by:register.go): Falls back to a newdefaultDeviceMemoryconfig value (in MiB). If not configured, the device is skipped with an error log instead of panicking.metrics.go): Skips memory metrics for unsupported devices instead of returning an error every scrape cycle.defaultDeviceMemoryfield toNvidiaConfig, Helm chart configmap, andvalues.yaml.Which issue(s) this PR fixes:
Fixes #1511
Special notes for your reviewer:
AI assistance disclosure: This PR was developed with Claude Code assisting in code analysis and review. The fix was designed, implemented, and validated by a human on real GB10 hardware.
Tested on real hardware: NVIDIA DGX Spark (GB10, ARM64, Driver 580.95.05, CUDA 13.0, Ubuntu 24.04, K8s v1.34.1).
panic: 0atregister.go:115❌defaultDeviceMemory: 131072defaultDeviceMemoryUsage — for unified memory GPUs, set
defaultDeviceMemoryto the total GPU memory in MiB:Without this config, the device will be skipped (not registered) but the plugin won't crash.
Does this PR introduce a user-facing change?:
Yes. Adds a new optional Helm value
devicePlugin.defaultDeviceMemory(default:0). Only needed for GPUs with unified memory architecture (e.g., NVIDIA GB10/DGX Spark) wherenvmlDeviceGetMemoryInfo()is not supported. When set, the device plugin uses this value as the total device memory fallback instead of panicking.