fix: handle GetMemoryInfo ERROR_NOT_SUPPORTED for unified memory GPUs #1637

jsl9208 · 2026-01-29T12:25:51Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

NVIDIA GB10 (DGX Spark) uses a unified memory architecture where CPU and GPU share the same physical memory. On these GPUs, nvmlDeviceGetMemoryInfo() returns ERROR_NOT_SUPPORTED instead of memory information. The current code treats any non-SUCCESS return as fatal and calls panic(0), which crashes the entire device plugin daemonset and prevents the node from registering any GPU devices.

This PR handles ERROR_NOT_SUPPORTED gracefully by:

Device registration (register.go): Falls back to a new defaultDeviceMemory config value (in MiB). If not configured, the device is skipped with an error log instead of panicking.
Metrics collection (metrics.go): Skips memory metrics for unsupported devices instead of returning an error every scrape cycle.
Config plumbing: Adds defaultDeviceMemory field to NvidiaConfig, Helm chart configmap, and values.yaml.

Which issue(s) this PR fixes:

Fixes #1511

Special notes for your reviewer:

AI assistance disclosure: This PR was developed with Claude Code assisting in code analysis and review. The fix was designed, implemented, and validated by a human on real GB10 hardware.

Tested on real hardware: NVIDIA DGX Spark (GB10, ARM64, Driver 580.95.05, CUDA 13.0, Ubuntu 24.04, K8s v1.34.1).

Scenario	Result
Unpatched v2.8.0 on GB10	`panic: 0` at `register.go:115` ❌
Patched with `defaultDeviceMemory: 131072`	Device registered, pod scheduled, vGPU isolation works ✅
Patched without `defaultDeviceMemory`	Device skipped gracefully, no crash ✅

Usage — for unified memory GPUs, set defaultDeviceMemory to the total GPU memory in MiB:

# values.yaml
devicePlugin:
  defaultDeviceMemory: 131072  # 128 GiB for GB10

Without this config, the device will be skipped (not registered) but the plugin won't crash.

Does this PR introduce a user-facing change?:

Yes. Adds a new optional Helm value devicePlugin.defaultDeviceMemory (default: 0). Only needed for GPUs with unified memory architecture (e.g., NVIDIA GB10/DGX Spark) where nvmlDeviceGetMemoryInfo() is not supported. When set, the device plugin uses this value as the total device memory fallback instead of panicking.

hami-robot · 2026-01-29T12:25:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jsl9208
Once this PR has been reviewed and has the lgtm label, please assign shouren for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2026-01-29T12:26:13Z

Summary of Changes

Hello @jsl9208, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug where the device plugin would crash when encountering NVIDIA GPUs with unified memory architectures (like GB10/DGX Spark) due to nvmlDeviceGetMemoryInfo() returning ERROR_NOT_SUPPORTED. The fix introduces a robust error handling mechanism, allowing the plugin to either use a user-defined default memory value or gracefully skip the device, ensuring stability and preventing daemonset panics. It also refines metrics collection for such devices and provides a new Helm configuration option for seamless deployment.

Highlights

Unified Memory GPU Support: The device plugin now gracefully handles ERROR_NOT_SUPPORTED from nvmlDeviceGetMemoryInfo() on unified memory GPUs (e.g., NVIDIA GB10), preventing crashes.
Configurable Memory Fallback: A new defaultDeviceMemory configuration option allows users to specify the total GPU memory in MiB for devices where GetMemoryInfo is not supported.
Graceful Device Handling: If defaultDeviceMemory is not configured for an unsupported device, the device is now skipped with an error log instead of causing the entire daemonset to panic.
Metrics Collection Improvement: Memory metrics collection now skips devices that return ERROR_NOT_SUPPORTED, avoiding repeated errors during Prometheus scrapes.
Helm Chart Integration: The defaultDeviceMemory field has been integrated into the Helm chart's values.yaml and device-configmap.yaml for easy deployment and configuration.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses a crash in the device plugin when running on GPUs with a unified memory architecture, such as the NVIDIA GB10. The changes gracefully handle the ERROR_NOT_SUPPORTED return code by introducing a fallback to a configurable defaultDeviceMemory value, which is a solid approach. The modifications to the metrics collection and Helm charts are also correct and complete the feature. I've included one high-severity suggestion to prevent a potential integer overflow and one medium-severity suggestion to improve logging for better debuggability.

gemini-code-assist · 2026-01-29T12:27:46Z

pkg/device-plugin/nvidiadevice/nvinternal/plugin/register.go

+		switch ret {
+		case nvml.SUCCESS:
 			memoryTotal = int(memory.Total)
-		} else {
+		case nvml.ERROR_NOT_SUPPORTED:
+			// Unified memory architecture GPUs (e.g., NVIDIA GB10/DGX Spark) don't support
+			// traditional memory queries. Use DefaultDeviceMemory from config as fallback.
+			if plugin.schedulerConfig.DefaultDeviceMemory > 0 {
+				memoryTotal = int(plugin.schedulerConfig.DefaultDeviceMemory) * 1024 * 1024
+				klog.Warningf("GetMemoryInfo not supported for device %s, using configured DefaultDeviceMemory: %d MB",
+					UUID, plugin.schedulerConfig.DefaultDeviceMemory)
+			} else {
+				klog.Errorf("GetMemoryInfo not supported for device %s (unified memory architecture) "+
+					"and DefaultDeviceMemory not configured. Skipping this device. "+
+					"Set 'defaultDeviceMemory' in nvidia config to the total GPU memory in MB.", UUID)
+				continue
+			}
+		default:
 			klog.Error("nvml get memory error ret=", ret)
 			panic(0)
 		}


The variable memoryTotal is inferred as int from its declaration on line 109. On a 32-bit system, this can lead to an integer overflow when handling GPUs with large memory, as both memory.Total (a uint64) and the calculation for defaultDeviceMemory can exceed the capacity of a 32-bit integer. This could lead to incorrect memory registration or panics.

To prevent this, memoryTotal should be declared as uint64 on line 109. Consequently, the assignments within this switch block should be updated to use uint64 as well (e.g., memoryTotal = memory.Total and memoryTotal = uint64(plugin.schedulerConfig.DefaultDeviceMemory) * 1024 * 1024).

Since line 109 is outside the diff, I'm providing this as a general comment on the block. Please consider applying this change for robustness.

That's pre-existing code outside this diff. Keeping this PR minimal to the bugfix scope.

gemini-code-assist · 2026-01-29T12:27:46Z

pkg/device-plugin/nvidiadevice/nvinternal/plugin/register.go

+				continue
+			}
+		default:
 			klog.Error("nvml get memory error ret=", ret)


For more informative logging, it's better to log the error string corresponding to the nvml.Return code instead of just the integer value. The nvml.ErrorString() function can be used for this. This will make debugging easier.

Suggested change

klog.Error("nvml get memory error ret=", ret)

klog.Errorf("nvml get memory error: %s", nvml.ErrorString(ret))

Same here, out of scope for this fix.

On NVIDIA GB10 (DGX Spark) and other unified memory architecture GPUs, nvmlDeviceGetMemoryInfo() returns ERROR_NOT_SUPPORTED, causing the device plugin to panic. Changes: - register.go: Handle ERROR_NOT_SUPPORTED by using DefaultDeviceMemory config as fallback. Skip device gracefully (continue) instead of panic when config is not set. - metrics.go: Skip memory metrics collection for unsupported devices. - device.go: Add DefaultDeviceMemory field to NvidiaConfig. - charts: Plumb defaultDeviceMemory through Helm values and ConfigMap. Fixes: Project-HAMi#1511 Signed-off-by: jsl9208 <shilong@heywhale.com>

hami-robot bot added kind/bug Something isn't working dco-signoff: no labels Jan 29, 2026

hami-robot bot requested a review from chaunceyjiang January 29, 2026 12:25

hami-robot bot requested a review from wawa0210 January 29, 2026 12:25

hami-robot bot added the size/S label Jan 29, 2026

gemini-code-assist bot reviewed Jan 29, 2026

View reviewed changes

jsl9208 force-pushed the fix/unified-memory-gpus branch from f186ab2 to 2f6958e Compare January 29, 2026 12:32

hami-robot bot added dco-signoff: yes and removed dco-signoff: no labels Jan 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle GetMemoryInfo ERROR_NOT_SUPPORTED for unified memory GPUs #1637

fix: handle GetMemoryInfo ERROR_NOT_SUPPORTED for unified memory GPUs #1637

jsl9208 commented Jan 29, 2026

Uh oh!

hami-robot bot commented Jan 29, 2026

Uh oh!

gemini-code-assist bot commented Jan 29, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 29, 2026

Uh oh!

jsl9208 Jan 29, 2026

Uh oh!

gemini-code-assist bot Jan 29, 2026

Uh oh!

jsl9208 Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	klog.Error("nvml get memory error ret=", ret)
	klog.Errorf("nvml get memory error: %s", nvml.ErrorString(ret))

fix: handle GetMemoryInfo ERROR_NOT_SUPPORTED for unified memory GPUs #1637

Are you sure you want to change the base?

fix: handle GetMemoryInfo ERROR_NOT_SUPPORTED for unified memory GPUs #1637

Conversation

jsl9208 commented Jan 29, 2026

Uh oh!

hami-robot bot commented Jan 29, 2026

Uh oh!

gemini-code-assist bot commented Jan 29, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

jsl9208 Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

jsl9208 Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant