-
Notifications
You must be signed in to change notification settings - Fork 470
fix: handle GetMemoryInfo NOT_SUPPORTED for unified memory GPUs (e.g., NVIDIA GB10) #1636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: handle GetMemoryInfo NOT_SUPPORTED for unified memory GPUs (e.g., NVIDIA GB10) #1636
Conversation
Bumps [aquasecurity/trivy-action](https://github.com/aquasecurity/trivy-action) from 0.30.0 to 0.31.0. - [Release notes](https://github.com/aquasecurity/trivy-action/releases) - [Commits](aquasecurity/trivy-action@0.30.0...0.31.0) --- updated-dependencies: - dependency-name: aquasecurity/trivy-action dependency-version: 0.31.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>
fix golint-CI Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Signed-off-by: wawa0210 <xiao.zhang@dynamia.ai>
Support topology-awareness for Kunlunxin device Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Signed-off-by: Lei Guo <Lei.Guo@metax-tech.com> Co-authored-by: Lei Guo <Lei.Guo@metax-tech.com>
Signed-off-by: lixd <xueduan.li@gmail.com>
…t-HAMi#1138) * fix: override node socre failure for kunlun Project-HAMi#1137 Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * Merge branch 'master' into bug-fix Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: update Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> --------- Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>
Signed-off-by: Jifei Wang <poff2001@outlook.com>
Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>
Signed-off-by: antvirf <antti.viitala@icloud.com>
Signed-off-by: calvin chen <wen.chen@dynamia.ai>
…roject-HAMi#1041) Signed-off-by: ghostloda <78798447@qq.com>
* fix: Multi-node scoring nodes are inaccurate Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> * fix: ut Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> --------- Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>
Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io> Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>
…scheduler roles + a namespace-scoped role for leader election Signed-off-by: antvirf <antti.viitala@icloud.com>
Signed-off-by: antvirf <antti.viitala@icloud.com>
…-in-release-changelog feat: Add new labels in .github/release.yml
…perms feat(scheduler-role): use a scoped-down role for scheduler
Signed-off-by: Jifei wang <poff2001@outlook.com> update vendor
…oject-HAMi#1161) Bumps [aquasecurity/trivy-action](https://github.com/aquasecurity/trivy-action) from 0.31.0 to 0.32.0. - [Release notes](https://github.com/aquasecurity/trivy-action/releases) - [Commits](aquasecurity/trivy-action@0.31.0...0.32.0) --- updated-dependencies: - dependency-name: aquasecurity/trivy-action dependency-version: 0.32.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* feat(helm): optionally disable admission webhook This simplifies the deployment considerably and makes HAMi less intrusive inclusters where only a minority of workloads actually require GPU scheduling. Signed-off-by: antvirf <antti.viitala@icloud.com> * fix(chart.yaml): keep original version from master Signed-off-by: antvirf <antti.viitala@icloud.com> * docs(helm): comment to explain impact of disabling admissionWebhook Signed-off-by: antvirf <antti.viitala@icloud.com> --------- Signed-off-by: antvirf <antti.viitala@icloud.com>
Signed-off-by: Yunlu Wen <yunlu.wen@transwarp.io> Co-authored-by: Yunlu Wen <yunlu.wen@transwarp.io>
Fix e2e CI Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Signed-off-by: Shouren Yang <yangshouren@gmail.com>
Signed-off-by: Shouren Yang <yangshouren@gmail.com>
Signed-off-by: Shouren Yang <yangshouren@gmail.com>
Signed-off-by: Shouren Yang <yangshouren@gmail.com>
…5.0 to 1.17.8 (Project-HAMi#1170) Signed-off-by: Shouren Yang <yangshouren@gmail.com>
…t-HAMi#1183) Signed-off-by: Shouren Yang <yangshouren@gmail.com>
…Mi#1189) Signed-off-by: Shouren Yang <yangshouren@gmail.com>
Signed-off-by: Mirza-Samad-Ahmed-Baig <Mirzasamadahmedbaig@gmail.com> Co-authored-by: Your Name <you@example.com>
* support dra Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai> Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai> * update ci workflow to build helm dependencies Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai> * update subchart Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai> --------- Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>
Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>
Project-HAMi#1563) * docs(DCU document update): Improve the relevant documentation for DCU virtualization. Signed-off-by: zqwangadv <870462232@qq.com> * docs(DCU document update): Improve the user manual for virtual DCU Improve the user manual for virtual DCU Signed-off-by: zqwangadv <870462232@qq.com> --------- Signed-off-by: zqwangadv <870462232@qq.com>
* update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update chart for CDI Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update docs Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * Update docs/config_cn.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update docs/config.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update documents Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> --------- Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Signed-off-by: james <open4pd@4paradigm.com>
…roject-HAMi#1553) * add flags for leaderelection Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * add leader manager Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: implement `Scheduler` with `LeaderManager` Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: implement `readyz` api and readiness probe Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * refactor: rename `hostName` param or field to `hostname` Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: notify channel 1. update notify channel with buffer of size 1 in leaderManager 2. init notify chan for dummy Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * refactor: rename `isHolder` method to `isHolderOf` Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: add `observedTime` field and `isLeaseValid` method Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * test: add test for leaderelection package Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * test: add tests for deleting lease Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: update readinessProbe using http or https controled by value `.scheduler.admissionWebhook.enabled` Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: initializing leader manager with correct param Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * add more comments Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * add check for empty hostname Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * test: add `GinkgoHelper` in assert helper to redirect to point to the correct call when failed Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * refactor: update fields order of `Scheduler` type Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * add license headers Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: do register after schduler started Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: add callbacks in leaderelection package and implement `WaitForCacheSynced` for Scheduler with callbacks Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * remove useless comment Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: check lease valid while checking leader Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: add pod anti affinity in deployment when leaderelect is enabled Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: list lease in specified namespace Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: extract default resync constant Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: use RUnlcok after RLock Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * trigger action Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * ci: extend timeout for Helm deploy in e2e test Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * Revert "ci: extend timeout for Helm deploy in e2e test" This reverts commit e7d9e43. Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: use `app.kubernetes.io/component` label for podAntiAffinity to avoid pod failed scheduling in postStartHook when deploying on a single node Signed-off-by: houyuxi <yuxi.hou@transwarp.io> --------- Signed-off-by: houyuxi <yuxi.hou@transwarp.io> Co-authored-by: houyuxi <yuxi.hou@transwarp.io>
* update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update chart for CDI Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update docs Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * Update docs/config_cn.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update docs/config.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update documents Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * fix kunlunxin vxpu issue Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * Update pkg/device/kunlun/vdevice.go Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update Makefile for helm package Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> --------- Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* feat: add certwatcher to watch certificate Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * feat: modify webhook default failurePolicy from `Ignore` to `Fail` Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * fix: ignore context.Canceled error Signed-off-by: houyuxi <yuxi.hou@transwarp.io> * Revert "feat: modify webhook default failurePolicy from `Ignore` to `Fail`" This reverts commit 36062b5. Signed-off-by: houyuxi <yuxi.hou@transwarp.io> --------- Signed-off-by: houyuxi <yuxi.hou@transwarp.io> Co-authored-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: james <open4pd@4paradigm.com>
Signed-off-by: qinxiaowen <mail@qxw.im>
…here are multiple containers in one pod (Project-HAMi#1579) * fix: fix resource quota Signed-off-by: james <open4pd@4paradigm.com> * fix: fix test case Signed-off-by: james <open4pd@4paradigm.com> --------- Signed-off-by: james <open4pd@4paradigm.com>
Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>
* add modernize check Signed-off-by: dongjiang1989 <dongjiang1989@126.com> * add unittest case Signed-off-by: dongjiang1989 <dongjiang1989@126.com> --------- Signed-off-by: dongjiang1989 <dongjiang1989@126.com>
…oject-HAMi#1584) Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.27.3 to 2.27.5. - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.27.3...v2.27.5) --- updated-dependencies: - dependency-name: github.com/onsi/ginkgo/v2 dependency-version: 2.27.5 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…Mi#1586) Bumps [golang.org/x/term](https://github.com/golang/term) from 0.38.0 to 0.39.0. - [Commits](golang/term@v0.38.0...v0.39.0) --- updated-dependencies: - dependency-name: golang.org/x/term dependency-version: 0.39.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update chart for CDI Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update docs Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * Update docs/config_cn.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update docs/config.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update documents Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * fix kunlunxin vxpu issue Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * Update pkg/device/kunlun/vdevice.go Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * update Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update Makefile for helm package Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * update discord invitation Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> * Update README.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update README_cn.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update README_ja.md Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Signed-off-by: limengxuan <mengxuan.li@dynamia.ai> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
On NVIDIA GB10 (DGX Spark) and other unified memory architecture GPUs, nvmlDeviceGetMemoryInfo() returns ERROR_NOT_SUPPORTED, causing the device plugin to panic. Changes: - register.go: Handle ERROR_NOT_SUPPORTED by using DefaultDeviceMemory config as fallback. Skip device gracefully (continue) instead of panic when config is not set. - metrics.go: Skip memory metrics collection for unsupported devices. - device.go: Add DefaultDeviceMemory field to NvidiaConfig. - charts: Plumb defaultDeviceMemory through Helm values and ConfigMap. Fixes: Project-HAMi#1511
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jsl9208 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits. 📝 Please follow instructions in the contributing guide to update your commits with the DCO Full details of the Developer Certificate of Origin can be found at developercertificate.org. The list of commits missing DCO signoff:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Welcome @jsl9208! It looks like this is your first PR to Project-HAMi/HAMi 🎉 |
Summary of ChangesHello @jsl9208, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces critical stability improvements for the NVIDIA device plugin, specifically addressing issues with GPUs that utilize a unified memory architecture, such as the NVIDIA GB10. The changes ensure that the device plugin can operate reliably on such hardware by providing a configurable fallback for device memory information or by gracefully handling unsupported memory queries, thus preventing application crashes and enhancing overall system robustness. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a crucial fix for GPUs with unified memory architecture by gracefully handling cases where memory information is not available, preventing panics in the device plugin. The addition of the defaultDeviceMemory configuration provides a flexible fallback mechanism.
The PR also includes a significant number of repository improvements, such as adding comprehensive documentation, issue templates, CI/CD configurations with golangci-lint and dependabot, and a new Helm chart. These changes greatly enhance the project's maintainability and developer experience.
I've added one comment regarding a potential division-by-zero issue in the metrics collection logic for safer metric reporting. Overall, this is an excellent and substantial contribution to the project.
| ch <- prometheus.MustNewConstMetric( | ||
| nodeGPUMemoryPercentage, | ||
| prometheus.GaugeValue, | ||
| float64(devs.Device.Usedmem)/float64(devs.Device.Totalmem), | ||
| nodeID, devs.Device.ID, fmt.Sprint(devs.Device.Index), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To prevent a potential division-by-zero if devs.Device.Totalmem is 0, it's safer to add a check before performing the division. A zero value could lead to NaN or Inf values in your metrics, which might cause issues with monitoring and alerting systems.
var memPercentage float64
if devs.Device.Totalmem > 0 {
memPercentage = float64(devs.Device.Usedmem) / float64(devs.Device.Totalmem)
}
ch <- prometheus.MustNewConstMetric(
nodeGPUMemoryPercentage,
prometheus.GaugeValue,
memPercentage,
nodeID, devs.Device.ID, fmt.Sprint(devs.Device.Index),
)
Summary
Fixes #1511
The NVIDIA GB10 (DGX Spark) uses a unified memory architecture where
nvmlDeviceGetMemoryInfo()returnsERROR_NOT_SUPPORTED. This causesregister.gotopanic(0), crashing the device plugin on these GPUs.Changes
register.go: HandleERROR_NOT_SUPPORTEDgracefully with aswitch/case. Fall back to a newDefaultDeviceMemoryconfig value (in MB). If not configured, skip the device with an error log instead of panicking.metrics.go: Skip memory metrics collection whenERROR_NOT_SUPPORTEDis returned, instead of treating it as a fatal error.device.go: AddDefaultDeviceMemory int64field toNvidiaConfig.defaultDeviceMemorythrough configmap andvalues.yaml(default:0).Usage
For unified memory GPUs, users must set
defaultDeviceMemoryto the total GPU memory in MB:Without this config, the device will be skipped (not registered) but the plugin won't crash.
Testing
Tested on real hardware: NVIDIA DGX Spark (GB10, ARM64, Driver 580.95.05, CUDA 13.0).
panic: 0atregister.go:115❌defaultDeviceMemory: 131072defaultDeviceMemory