fix: handle GetMemoryInfo NOT_SUPPORTED for unified memory GPUs (e.g., NVIDIA GB10) #1636

jsl9208 · 2026-01-29T11:38:34Z

Summary

The NVIDIA GB10 (DGX Spark) uses a unified memory architecture where nvmlDeviceGetMemoryInfo() returns ERROR_NOT_SUPPORTED. This causes register.go to panic(0), crashing the device plugin on these GPUs.

Changes

register.go: Handle ERROR_NOT_SUPPORTED gracefully with a switch/case. Fall back to a new DefaultDeviceMemory config value (in MB). If not configured, skip the device with an error log instead of panicking.
metrics.go: Skip memory metrics collection when ERROR_NOT_SUPPORTED is returned, instead of treating it as a fatal error.
device.go: Add DefaultDeviceMemory int64 field to NvidiaConfig.
Helm chart: Plumb defaultDeviceMemory through configmap and values.yaml (default: 0).

Usage

For unified memory GPUs, users must set defaultDeviceMemory to the total GPU memory in MB:

# values.yaml
devicePlugin:
  defaultDeviceMemory: 131072  # 128 GB for GB10

Without this config, the device will be skipped (not registered) but the plugin won't crash.

Testing

Tested on real hardware: NVIDIA DGX Spark (GB10, ARM64, Driver 580.95.05, CUDA 13.0).

Scenario	Result
Unpatched v2.8.0 on GB10	`panic: 0` at `register.go:115` ❌
Patched v2.8.0 with `defaultDeviceMemory: 131072`	Device registered, pod scheduled, vGPU memory isolation works ✅
Patched v2.8.0 without `defaultDeviceMemory`	Device skipped gracefully, no crash ✅

Bumps [aquasecurity/trivy-action](https://github.com/aquasecurity/trivy-action) from 0.30.0 to 0.31.0. - [Release notes](https://github.com/aquasecurity/trivy-action/releases) - [Commits](aquasecurity/trivy-action@0.30.0...0.31.0) --- updated-dependencies: - dependency-name: aquasecurity/trivy-action dependency-version: 0.31.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

fix golint-CI Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Signed-off-by: wawa0210 <xiao.zhang@dynamia.ai>

Support topology-awareness for Kunlunxin device Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Signed-off-by: Lei Guo <Lei.Guo@metax-tech.com> Co-authored-by: Lei Guo <Lei.Guo@metax-tech.com>

Signed-off-by: lixd <xueduan.li@gmail.com>

…t-HAMi#1138) * fix: override node socre failure for kunlun Project-HAMi#1137 Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * Merge branch 'master' into bug-fix Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: update Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> --------- Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

Signed-off-by: Jifei Wang <poff2001@outlook.com>

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

Signed-off-by: antvirf <antti.viitala@icloud.com>

Signed-off-by: calvin chen <wen.chen@dynamia.ai>

…roject-HAMi#1041) Signed-off-by: ghostloda <78798447@qq.com>

* fix: Multi-node scoring nodes are inaccurate Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> --------- Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

…scheduler roles + a namespace-scoped role for leader election Signed-off-by: antvirf <antti.viitala@icloud.com>

Signed-off-by: antvirf <antti.viitala@icloud.com>

…-in-release-changelog feat: Add new labels in .github/release.yml

…perms feat(scheduler-role): use a scoped-down role for scheduler

Signed-off-by: Jifei wang <poff2001@outlook.com> update vendor

…oject-HAMi#1161) Bumps [aquasecurity/trivy-action](https://github.com/aquasecurity/trivy-action) from 0.31.0 to 0.32.0. - [Release notes](https://github.com/aquasecurity/trivy-action/releases) - [Commits](aquasecurity/trivy-action@0.31.0...0.32.0) --- updated-dependencies: - dependency-name: aquasecurity/trivy-action dependency-version: 0.32.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat(helm): optionally disable admission webhook This simplifies the deployment considerably and makes HAMi less intrusive inclusters where only a minority of workloads actually require GPU scheduling. Signed-off-by: antvirf <antti.viitala@icloud.com> * fix(chart.yaml): keep original version from master Signed-off-by: antvirf <antti.viitala@icloud.com> * docs(helm): comment to explain impact of disabling admissionWebhook Signed-off-by: antvirf <antti.viitala@icloud.com> --------- Signed-off-by: antvirf <antti.viitala@icloud.com>

Signed-off-by: Yunlu Wen <yunlu.wen@transwarp.io> Co-authored-by: Yunlu Wen <yunlu.wen@transwarp.io>

Fix e2e CI Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Signed-off-by: Shouren Yang <yangshouren@gmail.com>

…5.0 to 1.17.8 (Project-HAMi#1170) Signed-off-by: Shouren Yang <yangshouren@gmail.com>

…t-HAMi#1183) Signed-off-by: Shouren Yang <yangshouren@gmail.com>

…Mi#1189) Signed-off-by: Shouren Yang <yangshouren@gmail.com>

Signed-off-by: Mirza-Samad-Ahmed-Baig <Mirzasamadahmedbaig@gmail.com> Co-authored-by: Your Name <you@example.com>

* support dra Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai> Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai> * update ci workflow to build helm dependencies Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai> * update subchart Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai> --------- Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>

Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>

Project-HAMi#1563) * docs(DCU document update): Improve the relevant documentation for DCU virtualization. Signed-off-by: zqwangadv <870462232@qq.com> * docs(DCU document update): Improve the user manual for virtual DCU Improve the user manual for virtual DCU Signed-off-by: zqwangadv <870462232@qq.com> --------- Signed-off-by: zqwangadv <870462232@qq.com>

* update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update chart for CDI Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update docs Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * Update docs/config_cn.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update docs/config.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update documents Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> --------- Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Signed-off-by: james <open4pd@4paradigm.com>

…roject-HAMi#1553) * add flags for leaderelection Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * add leader manager Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: implement `Scheduler` with `LeaderManager` Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: implement `readyz` api and readiness probe Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * refactor: rename `hostName` param or field to `hostname` Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: notify channel 1. update notify channel with buffer of size 1 in leaderManager 2. init notify chan for dummy Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * refactor: rename `isHolder` method to `isHolderOf` Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: add `observedTime` field and `isLeaseValid` method Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * test: add test for leaderelection package Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * test: add tests for deleting lease Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: update readinessProbe using http or https controled by value `.scheduler.admissionWebhook.enabled` Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: initializing leader manager with correct param Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * add more comments Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * add check for empty hostname Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * test: add `GinkgoHelper` in assert helper to redirect to point to the correct call when failed Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * refactor: update fields order of `Scheduler` type Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * add license headers Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: do register after schduler started Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: add callbacks in leaderelection package and implement `WaitForCacheSynced` for Scheduler with callbacks Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * remove useless comment Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: check lease valid while checking leader Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: add pod anti affinity in deployment when leaderelect is enabled Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: list lease in specified namespace Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: extract default resync constant Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: use RUnlcok after RLock Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * trigger action Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * ci: extend timeout for Helm deploy in e2e test Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * Revert "ci: extend timeout for Helm deploy in e2e test" This reverts commit e7d9e43. Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: use `app.kubernetes.io/component` label for podAntiAffinity to avoid pod failed scheduling in postStartHook when deploying on a single node Signed-off-by: houyuxi <yuxi.hou@transwarp.io> --------- Signed-off-by: houyuxi <yuxi.hou@transwarp.io> Co-authored-by: houyuxi <yuxi.hou@transwarp.io>

* update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update chart for CDI Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update docs Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * Update docs/config_cn.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update docs/config.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update documents Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * fix kunlunxin vxpu issue Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * Update pkg/device/kunlun/vdevice.go Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update Makefile for helm package Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> --------- Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* feat: add certwatcher to watch certificate Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: modify webhook default failurePolicy from `Ignore` to `Fail` Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: ignore context.Canceled error Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * Revert "feat: modify webhook default failurePolicy from `Ignore` to `Fail`" This reverts commit 36062b5. Signed-off-by: houyuxi <yuxi.hou@transwarp.io> --------- Signed-off-by: houyuxi <yuxi.hou@transwarp.io> Co-authored-by: houyuxi <yuxi.hou@transwarp.io>

Signed-off-by: james <open4pd@4paradigm.com>

Signed-off-by: qinxiaowen <mail@qxw.im>

…here are multiple containers in one pod (Project-HAMi#1579) * fix: fix resource quota Signed-off-by: james <open4pd@4paradigm.com> * fix: fix test case Signed-off-by: james <open4pd@4paradigm.com> --------- Signed-off-by: james <open4pd@4paradigm.com>

Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>

* add modernize check Signed-off-by: dongjiang1989 <dongjiang1989@126.com> * add unittest case Signed-off-by: dongjiang1989 <dongjiang1989@126.com> --------- Signed-off-by: dongjiang1989 <dongjiang1989@126.com>

…oject-HAMi#1584) Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.27.3 to 2.27.5. - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.27.3...v2.27.5) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.27.5 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…Mi#1586) Bumps [golang.org/x/term](https://github.com/golang/term) from 0.38.0 to 0.39.0. - [Commits](golang/term@v0.38.0...v0.39.0) --- updated-dependencies: - dependency-name: golang.org/x/term dependency-version: 0.39.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update chart for CDI Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update docs Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * Update docs/config_cn.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update docs/config.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update documents Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * fix kunlunxin vxpu issue Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * Update pkg/device/kunlun/vdevice.go Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update Makefile for helm package Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update discord invitation Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * Update README.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update README_cn.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update README_ja.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

On NVIDIA GB10 (DGX Spark) and other unified memory architecture GPUs, nvmlDeviceGetMemoryInfo() returns ERROR_NOT_SUPPORTED, causing the device plugin to panic. Changes: - register.go: Handle ERROR_NOT_SUPPORTED by using DefaultDeviceMemory config as fallback. Skip device gracefully (continue) instead of panic when config is not set. - metrics.go: Skip memory metrics collection for unsupported devices. - device.go: Add DefaultDeviceMemory field to NvidiaConfig. - charts: Plumb defaultDeviceMemory through Helm values and ConfigMap. Fixes: Project-HAMi#1511

hami-robot · 2026-01-29T11:38:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jsl9208
Once this PR has been reviewed and has the lgtm label, please assign wawa0210 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hami-robot · 2026-01-29T11:38:41Z

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

c301f9b Add unit tests for cambricon's Fit Function (Add unit tests for cambricon's Fit Function #1198)
5dd402f Add Ascend unit tests, mainly for Fit Function (Add unit tests for Ascend's Fit Function #1197)
56ffbcb refactor: clean up code and improve maintainability (refactor: clean up code and improve maintainability #1195)
0aa7bb8 fix: benchmarks/ai-benchmark/Dockerfile to reduce vulnerabilities ([Snyk] Security upgrade tensorflow/tensorflow from latest-gpu to 2.20.0rc0-gpu #1243)
1591156 fix: handle GetMemoryInfo NOT_SUPPORTED for unified memory GPUs

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

hami-robot · 2026-01-29T11:38:44Z

Welcome @jsl9208! It looks like this is your first PR to Project-HAMi/HAMi 🎉

gemini-code-assist · 2026-01-29T11:39:46Z

Summary of Changes

Hello @jsl9208, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces critical stability improvements for the NVIDIA device plugin, specifically addressing issues with GPUs that utilize a unified memory architecture, such as the NVIDIA GB10. The changes ensure that the device plugin can operate reliably on such hardware by providing a configurable fallback for device memory information or by gracefully handling unsupported memory queries, thus preventing application crashes and enhancing overall system robustness.

Highlights

Unified Memory GPU Support: The device plugin now gracefully handles nvmlDeviceGetMemoryInfo() returning ERROR_NOT_SUPPORTED for unified memory GPUs (e.g., NVIDIA GB10), preventing crashes.
Configurable Default Memory: A new DefaultDeviceMemory configuration value (in MB) has been introduced. If ERROR_NOT_SUPPORTED occurs, this value is used as a fallback for device memory.
Graceful Device Skipping: If DefaultDeviceMemory is not configured for unified memory GPUs that return ERROR_NOT_SUPPORTED, the device will be skipped with an error log instead of causing a panic, ensuring plugin stability.
Metrics Collection Adjustment: Memory metrics collection is now skipped when ERROR_NOT_SUPPORTED is returned by nvmlDeviceGetMemoryInfo(), preventing fatal errors during monitoring.
Helm Chart Integration: The defaultDeviceMemory parameter has been integrated into the Helm chart's configmap and values.yaml for easy configuration.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files

Ignored by pattern: .github/workflows/** (16)
- .github/workflows/auto-label-pr.yaml
- .github/workflows/auto-release.yaml
- .github/workflows/call-e2e-upgrade.yaml
- .github/workflows/call-e2e.yaml
- .github/workflows/call-release-helm.yaml
- .github/workflows/call-release-image-hamicore.yaml
- .github/workflows/call-release-image.yaml
- .github/workflows/call-release-notes.yaml
- .github/workflows/call-release-website.yaml
- .github/workflows/ci-image-scanning.yaml
- .github/workflows/ci.yaml
- .github/workflows/codeql-analysis.yml
- .github/workflows/issue-translate.yaml
- .github/workflows/lint-chart.yaml
- .github/workflows/stale.yaml
- .github/workflows/test-self-hosted.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a crucial fix for GPUs with unified memory architecture by gracefully handling cases where memory information is not available, preventing panics in the device plugin. The addition of the defaultDeviceMemory configuration provides a flexible fallback mechanism.
The PR also includes a significant number of repository improvements, such as adding comprehensive documentation, issue templates, CI/CD configurations with golangci-lint and dependabot, and a new Helm chart. These changes greatly enhance the project's maintainability and developer experience.
I've added one comment regarding a potential division-by-zero issue in the metrics collection logic for safer metric reporting. Overall, this is an excellent and substantial contribution to the project.

gemini-code-assist · 2026-01-29T11:54:38Z

cmd/scheduler/metrics.go

+			ch <- prometheus.MustNewConstMetric(
+				nodeGPUMemoryPercentage,
+				prometheus.GaugeValue,
+				float64(devs.Device.Usedmem)/float64(devs.Device.Totalmem),
+				nodeID, devs.Device.ID, fmt.Sprint(devs.Device.Index),
+			)


To prevent a potential division-by-zero if devs.Device.Totalmem is 0, it's safer to add a check before performing the division. A zero value could lead to NaN or Inf values in your metrics, which might cause issues with monitoring and alerting systems.

var memPercentage float64 if devs.Device.Totalmem > 0 { memPercentage = float64(devs.Device.Usedmem) / float64(devs.Device.Totalmem) } ch <- prometheus.MustNewConstMetric( nodeGPUMemoryPercentage, prometheus.GaugeValue, memPercentage, nodeID, devs.Device.ID, fmt.Sprint(devs.Device.Index), )

dependabot bot and others added 30 commits June 16, 2025 11:53

Fix golint-CI (Project-HAMi#1127)

0359e4b

fix golint-CI Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

add new reviewers (Project-HAMi#1131)

3802f2f

Signed-off-by: wawa0210 <xiao.zhang@dynamia.ai>

Support topology-awareness for Kunlunxin device (Project-HAMi#1121)

a90081d

Support topology-awareness for Kunlunxin device Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Support Metax sGPU Qos Policy (Project-HAMi#1123)

7f0dadc

Signed-off-by: Lei Guo <Lei.Guo@metax-tech.com> Co-authored-by: Lei Guo <Lei.Guo@metax-tech.com>

docs: fix broken link and typo (Project-HAMi#1125)

e377967

Signed-off-by: lixd <xueduan.li@gmail.com>

add new interconnection support (Project-HAMi#1141)

a9e1677

Signed-off-by: Jifei Wang <poff2001@outlook.com>

fix: mig mode-change Project-HAMi#1116 (Project-HAMi#1124)

db005a8

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

feat(scheduler-role): use a scoped-down role for scheduler

13dd7b9

Signed-off-by: antvirf <antti.viitala@icloud.com>

add global image for chart (Project-HAMi#1133)

b617bab

Signed-off-by: calvin chen <wen.chen@dynamia.ai>

fix: Skip admission webhook when Pod's scheduler is already assigned (P…

8d360ad

…roject-HAMi#1041) Signed-off-by: ghostloda <78798447@qq.com>

fix: An error occurred while create Iluvatar pod (Project-HAMi#1149)

51fefad

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

feat(scheduler-roles): use built-in system:kube-scheduler and volume-…

08a818b

…scheduler roles + a namespace-scoped role for leader election Signed-off-by: antvirf <antti.viitala@icloud.com>

feat(scheduler-roles): remove unneeded update perm

e6468dd

Signed-off-by: antvirf <antti.viitala@icloud.com>

Merge pull request Project-HAMi#1066 from Shouren/feat/add-new-labels…

e50b913

…-in-release-changelog feat: Add new labels in .github/release.yml

Merge pull request Project-HAMi#1152 from Antvirf/feat/minimize-role-…

5634da2

…perms feat(scheduler-role): use a scoped-down role for scheduler

clearly list supported devices doc entry

5930d95

Signed-off-by: Jifei wang <poff2001@outlook.com> update vendor

Add node config (Project-HAMi#1159)

407ff3b

Signed-off-by: Yunlu Wen <yunlu.wen@transwarp.io> Co-authored-by: Yunlu Wen <yunlu.wen@transwarp.io>

Fix e2e CI (Project-HAMi#1165)

0756fa8

Fix e2e CI Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

fix: Add option for overwrite schedulerName (Project-HAMi#1163)

cbcd55e

Signed-off-by: Shouren Yang <yangshouren@gmail.com>

build(deps): upgrade golang to 1.24.4 (Project-HAMi#1172)

493fc99

Signed-off-by: Shouren Yang <yangshouren@gmail.com>

build(deps): upgrade golang image in ci to 1.24.4 (Project-HAMi#1176)

e1d43b9

Signed-off-by: Shouren Yang <yangshouren@gmail.com>

build(deps): Upgrade controller-runtime to 0.21.0 (Project-HAMi#1171)

24986ad

Signed-off-by: Shouren Yang <yangshouren@gmail.com>

build(deps): dump github.com/NVIDIA/nvidia-container-toolkit from 1.1…

c77d8b3

…5.0 to 1.17.8 (Project-HAMi#1170) Signed-off-by: Shouren Yang <yangshouren@gmail.com>

fix: using go-safecast to fix incorrect conversion of numbers (Projec…

2122e64

…t-HAMi#1183) Signed-off-by: Shouren Yang <yangshouren@gmail.com>

fix: deal with security issues reported by Trivy in image (Project-HA…

4fffa54

…Mi#1189) Signed-off-by: Shouren Yang <yangshouren@gmail.com>

Mirza-Samad-Ahmed-Baig and others added 19 commits December 30, 2025 17:24

fix(nodelock): improve parsing of lock annotation (Project-HAMi#1560)

f3ec017

Signed-off-by: Mirza-Samad-Ahmed-Baig <Mirzasamadahmedbaig@gmail.com> Co-authored-by: Your Name <you@example.com>

clean deprecated metrics (Project-HAMi#1562)

86307cf

Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>

Fix kunlunxin vxpu issue (Project-HAMi#1569)

b81987a

* update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

feat: update mock-device-plugin version (Project-HAMi#1570)

6b57f36

Signed-off-by: james <open4pd@4paradigm.com>

add DSFans2014 to reviewers (Project-HAMi#1575)

0dbb92f

Signed-off-by: james <open4pd@4paradigm.com>

Added concurrent tests for ListNodes() (Project-HAMi#1576)

2acd27d

Signed-off-by: qinxiaowen <mail@qxw.im>

update dra version (Project-HAMi#1580)

8a06be3

Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>

Add modernize check (Project-HAMi#1578)

01d3704

* add modernize check Signed-off-by: dongjiang1989 <dongjiang1989@126.com> * add unittest case Signed-off-by: dongjiang1989 <dongjiang1989@126.com> --------- Signed-off-by: dongjiang1989 <dongjiang1989@126.com>

hami-robot bot requested a review from FouoF January 29, 2026 11:38

hami-robot bot added the dco-signoff: no label Jan 29, 2026

hami-robot bot requested a review from lengrongfu January 29, 2026 11:38

github-actions bot added the kind/bug Something isn't working label Jan 29, 2026

hami-robot bot added the size/XXL label Jan 29, 2026

jsl9208 closed this Jan 29, 2026

gemini-code-assist bot reviewed Jan 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle GetMemoryInfo NOT_SUPPORTED for unified memory GPUs (e.g., NVIDIA GB10) #1636

fix: handle GetMemoryInfo NOT_SUPPORTED for unified memory GPUs (e.g., NVIDIA GB10) #1636

Uh oh!

jsl9208 commented Jan 29, 2026

Uh oh!

hami-robot bot commented Jan 29, 2026

Uh oh!

hami-robot bot commented Jan 29, 2026

Uh oh!

hami-robot bot commented Jan 29, 2026

Uh oh!

gemini-code-assist bot commented Jan 29, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

fix: handle GetMemoryInfo NOT_SUPPORTED for unified memory GPUs (e.g., NVIDIA GB10) #1636

fix: handle GetMemoryInfo NOT_SUPPORTED for unified memory GPUs (e.g., NVIDIA GB10) #1636

Uh oh!

Conversation

jsl9208 commented Jan 29, 2026

Summary

Changes

Usage

Testing

Uh oh!

hami-robot bot commented Jan 29, 2026

Uh oh!

hami-robot bot commented Jan 29, 2026

Uh oh!

hami-robot bot commented Jan 29, 2026

Uh oh!

gemini-code-assist bot commented Jan 29, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants