Skip to content

Conversation

@jsl9208
Copy link

@jsl9208 jsl9208 commented Jan 29, 2026

Summary

Fixes #1511

The NVIDIA GB10 (DGX Spark) uses a unified memory architecture where nvmlDeviceGetMemoryInfo() returns ERROR_NOT_SUPPORTED. This causes register.go to panic(0), crashing the device plugin on these GPUs.

Changes

  • register.go: Handle ERROR_NOT_SUPPORTED gracefully with a switch/case. Fall back to a new DefaultDeviceMemory config value (in MB). If not configured, skip the device with an error log instead of panicking.
  • metrics.go: Skip memory metrics collection when ERROR_NOT_SUPPORTED is returned, instead of treating it as a fatal error.
  • device.go: Add DefaultDeviceMemory int64 field to NvidiaConfig.
  • Helm chart: Plumb defaultDeviceMemory through configmap and values.yaml (default: 0).

Usage

For unified memory GPUs, users must set defaultDeviceMemory to the total GPU memory in MB:

# values.yaml
devicePlugin:
  defaultDeviceMemory: 131072  # 128 GB for GB10

Without this config, the device will be skipped (not registered) but the plugin won't crash.

Testing

Tested on real hardware: NVIDIA DGX Spark (GB10, ARM64, Driver 580.95.05, CUDA 13.0).

Scenario Result
Unpatched v2.8.0 on GB10 panic: 0 at register.go:115
Patched v2.8.0 with defaultDeviceMemory: 131072 Device registered, pod scheduled, vGPU memory isolation works ✅
Patched v2.8.0 without defaultDeviceMemory Device skipped gracefully, no crash ✅

dependabot bot and others added 30 commits June 16, 2025 11:53
Bumps [aquasecurity/trivy-action](https://github.com/aquasecurity/trivy-action) from 0.30.0 to 0.31.0.
- [Release notes](https://github.com/aquasecurity/trivy-action/releases)
- [Commits](aquasecurity/trivy-action@0.30.0...0.31.0)

---
updated-dependencies:
- dependency-name: aquasecurity/trivy-action
  dependency-version: 0.31.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
fix golint-CI

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Signed-off-by: wawa0210 <xiao.zhang@dynamia.ai>
Support topology-awareness for Kunlunxin device

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Signed-off-by: Lei Guo <Lei.Guo@metax-tech.com>
Co-authored-by: Lei Guo <Lei.Guo@metax-tech.com>
Signed-off-by: lixd <xueduan.li@gmail.com>
…t-HAMi#1138)

* fix: override node socre failure for kunlun Project-HAMi#1137

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

* Merge branch 'master' into bug-fix

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

* fix: update

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

---------

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>
Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>
Signed-off-by: Jifei Wang <poff2001@outlook.com>
Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>
Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>
Signed-off-by: antvirf <antti.viitala@icloud.com>
Signed-off-by: calvin chen <wen.chen@dynamia.ai>
* fix: Multi-node scoring nodes are inaccurate

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

* fix: ut

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>

---------

Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>
Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>
Signed-off-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>
Co-authored-by: ouyangluwei(riseunion) <ouyangluwei@riseunion.io>
…scheduler roles + a namespace-scoped role for leader election

Signed-off-by: antvirf <antti.viitala@icloud.com>
Signed-off-by: antvirf <antti.viitala@icloud.com>
…-in-release-changelog

feat: Add new labels in .github/release.yml
…perms

feat(scheduler-role): use a scoped-down role for scheduler
Signed-off-by: Jifei wang <poff2001@outlook.com>

update vendor
…oject-HAMi#1161)

Bumps [aquasecurity/trivy-action](https://github.com/aquasecurity/trivy-action) from 0.31.0 to 0.32.0.
- [Release notes](https://github.com/aquasecurity/trivy-action/releases)
- [Commits](aquasecurity/trivy-action@0.31.0...0.32.0)

---
updated-dependencies:
- dependency-name: aquasecurity/trivy-action
  dependency-version: 0.32.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* feat(helm): optionally disable admission webhook

This simplifies the deployment considerably and makes HAMi less
intrusive inclusters where only a minority of workloads actually require
GPU scheduling.

Signed-off-by: antvirf <antti.viitala@icloud.com>

* fix(chart.yaml): keep original version from master

Signed-off-by: antvirf <antti.viitala@icloud.com>

* docs(helm): comment to explain impact of disabling admissionWebhook

Signed-off-by: antvirf <antti.viitala@icloud.com>

---------

Signed-off-by: antvirf <antti.viitala@icloud.com>
Signed-off-by: Yunlu Wen <yunlu.wen@transwarp.io>
Co-authored-by: Yunlu Wen <yunlu.wen@transwarp.io>
Fix e2e CI

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Signed-off-by: Shouren Yang <yangshouren@gmail.com>
Signed-off-by: Shouren Yang <yangshouren@gmail.com>
Signed-off-by: Shouren Yang <yangshouren@gmail.com>
Signed-off-by: Shouren Yang <yangshouren@gmail.com>
…5.0 to 1.17.8 (Project-HAMi#1170)

Signed-off-by: Shouren Yang <yangshouren@gmail.com>
…Mi#1189)

Signed-off-by: Shouren Yang <yangshouren@gmail.com>
Mirza-Samad-Ahmed-Baig and others added 19 commits December 30, 2025 17:24
Signed-off-by: Mirza-Samad-Ahmed-Baig <Mirzasamadahmedbaig@gmail.com>
Co-authored-by: Your Name <you@example.com>
* support dra

Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>

Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>

* update ci workflow to build helm dependencies

Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>

* update subchart

Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>

---------

Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>
Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>
Project-HAMi#1563)

* docs(DCU document update): Improve the relevant documentation for DCU virtualization.

Signed-off-by: zqwangadv <870462232@qq.com>

* docs(DCU document update): Improve the user manual for virtual DCU

Improve the user manual for virtual DCU

Signed-off-by: zqwangadv <870462232@qq.com>

---------

Signed-off-by: zqwangadv <870462232@qq.com>
* update

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* update

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* update chart for CDI

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* update docs

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* Update docs/config_cn.md

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update docs/config.md

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* update documents

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

---------

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* update

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Signed-off-by: james <open4pd@4paradigm.com>
…roject-HAMi#1553)

* add flags for leaderelection

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* add leader manager

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* feat: implement `Scheduler` with `LeaderManager`

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* feat: implement `readyz` api and readiness probe

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* refactor: rename `hostName` param or field to `hostname`

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* fix: notify channel

1. update notify channel with buffer of size 1 in leaderManager
2. init notify chan for dummy

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* refactor: rename `isHolder` method to `isHolderOf`

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* feat: add  `observedTime` field and `isLeaseValid` method

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* test: add test for leaderelection package

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* test: add tests for deleting lease

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* fix: update readinessProbe using http or https controled by value `.scheduler.admissionWebhook.enabled`

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* fix: initializing leader manager with correct param

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* add more comments

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* add check for empty hostname

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* test: add `GinkgoHelper` in assert helper to redirect to point to the correct call when failed

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* refactor: update fields order of `Scheduler` type

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* add license headers

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* fix: do register after schduler started

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* feat: add callbacks in leaderelection package and implement `WaitForCacheSynced` for Scheduler with callbacks

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* remove useless comment

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* feat: check lease valid while checking leader

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* feat: add pod anti affinity in deployment when leaderelect is enabled

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* fix: list lease in specified namespace

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* feat: extract default resync constant

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* fix: use RUnlcok after RLock

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* trigger action

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* ci: extend timeout for Helm deploy in e2e test

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* Revert "ci: extend timeout for Helm deploy in e2e test"

This reverts commit e7d9e43.

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* fix: use `app.kubernetes.io/component` label for podAntiAffinity to avoid pod failed scheduling in postStartHook when deploying on a single node

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

---------

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Co-authored-by: houyuxi <yuxi.hou@transwarp.io>
* update

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* update

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* update chart for CDI

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* update docs

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* Update docs/config_cn.md

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update docs/config.md

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* update documents

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* fix kunlunxin vxpu issue

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* Update pkg/device/kunlun/vdevice.go

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* update

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* update Makefile for helm package

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

---------

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
* feat: add certwatcher to watch certificate

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* feat: modify webhook default failurePolicy from `Ignore` to `Fail`

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* fix: ignore context.Canceled  error

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

* Revert "feat: modify webhook default failurePolicy from `Ignore` to `Fail`"

This reverts commit 36062b5.

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>

---------

Signed-off-by: houyuxi <yuxi.hou@transwarp.io>
Co-authored-by: houyuxi <yuxi.hou@transwarp.io>
Signed-off-by: james <open4pd@4paradigm.com>
Signed-off-by: qinxiaowen <mail@qxw.im>
…here are multiple containers in one pod (Project-HAMi#1579)

* fix: fix resource quota

Signed-off-by: james <open4pd@4paradigm.com>

* fix: fix test case

Signed-off-by: james <open4pd@4paradigm.com>

---------

Signed-off-by: james <open4pd@4paradigm.com>
Signed-off-by: Jifei Wang <jifei.wang@dynamia.ai>
* add modernize check

Signed-off-by: dongjiang1989 <dongjiang1989@126.com>

* add unittest case

Signed-off-by: dongjiang1989 <dongjiang1989@126.com>

---------

Signed-off-by: dongjiang1989 <dongjiang1989@126.com>
…oject-HAMi#1584)

Bumps [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) from 2.27.3 to 2.27.5.
- [Release notes](https://github.com/onsi/ginkgo/releases)
- [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md)
- [Commits](onsi/ginkgo@v2.27.3...v2.27.5)

---
updated-dependencies:
- dependency-name: github.com/onsi/ginkgo/v2
  dependency-version: 2.27.5
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…Mi#1586)

Bumps [golang.org/x/term](https://github.com/golang/term) from 0.38.0 to 0.39.0.
- [Commits](golang/term@v0.38.0...v0.39.0)

---
updated-dependencies:
- dependency-name: golang.org/x/term
  dependency-version: 0.39.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* update

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* update

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* update chart for CDI

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* update docs

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* Update docs/config_cn.md

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update docs/config.md

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* update documents

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* fix kunlunxin vxpu issue

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* Update pkg/device/kunlun/vdevice.go

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* update

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* update Makefile for helm package

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* update discord invitation

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

* Update README.md

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update README_cn.md

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update README_ja.md

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
On NVIDIA GB10 (DGX Spark) and other unified memory architecture GPUs,
nvmlDeviceGetMemoryInfo() returns ERROR_NOT_SUPPORTED, causing the device
plugin to panic.

Changes:
- register.go: Handle ERROR_NOT_SUPPORTED by using DefaultDeviceMemory
  config as fallback. Skip device gracefully (continue) instead of
  panic when config is not set.
- metrics.go: Skip memory metrics collection for unsupported devices.
- device.go: Add DefaultDeviceMemory field to NvidiaConfig.
- charts: Plumb defaultDeviceMemory through Helm values and ConfigMap.

Fixes: Project-HAMi#1511
@hami-robot hami-robot bot requested a review from FouoF January 29, 2026 11:38
@hami-robot
Copy link
Contributor

hami-robot bot commented Jan 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jsl9208
Once this PR has been reviewed and has the lgtm label, please assign wawa0210 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot
Copy link
Contributor

hami-robot bot commented Jan 29, 2026

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hami-robot
Copy link
Contributor

hami-robot bot commented Jan 29, 2026

Welcome @jsl9208! It looks like this is your first PR to Project-HAMi/HAMi 🎉

@github-actions github-actions bot added the kind/bug Something isn't working label Jan 29, 2026
@hami-robot hami-robot bot added the size/XXL label Jan 29, 2026
@jsl9208 jsl9208 closed this Jan 29, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jsl9208, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces critical stability improvements for the NVIDIA device plugin, specifically addressing issues with GPUs that utilize a unified memory architecture, such as the NVIDIA GB10. The changes ensure that the device plugin can operate reliably on such hardware by providing a configurable fallback for device memory information or by gracefully handling unsupported memory queries, thus preventing application crashes and enhancing overall system robustness.

Highlights

  • Unified Memory GPU Support: The device plugin now gracefully handles nvmlDeviceGetMemoryInfo() returning ERROR_NOT_SUPPORTED for unified memory GPUs (e.g., NVIDIA GB10), preventing crashes.
  • Configurable Default Memory: A new DefaultDeviceMemory configuration value (in MB) has been introduced. If ERROR_NOT_SUPPORTED occurs, this value is used as a fallback for device memory.
  • Graceful Device Skipping: If DefaultDeviceMemory is not configured for unified memory GPUs that return ERROR_NOT_SUPPORTED, the device will be skipped with an error log instead of causing a panic, ensuring plugin stability.
  • Metrics Collection Adjustment: Memory metrics collection is now skipped when ERROR_NOT_SUPPORTED is returned by nvmlDeviceGetMemoryInfo(), preventing fatal errors during monitoring.
  • Helm Chart Integration: The defaultDeviceMemory parameter has been integrated into the Helm chart's configmap and values.yaml for easy configuration.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Ignored Files
  • Ignored by pattern: .github/workflows/** (16)
    • .github/workflows/auto-label-pr.yaml
    • .github/workflows/auto-release.yaml
    • .github/workflows/call-e2e-upgrade.yaml
    • .github/workflows/call-e2e.yaml
    • .github/workflows/call-release-helm.yaml
    • .github/workflows/call-release-image-hamicore.yaml
    • .github/workflows/call-release-image.yaml
    • .github/workflows/call-release-notes.yaml
    • .github/workflows/call-release-website.yaml
    • .github/workflows/ci-image-scanning.yaml
    • .github/workflows/ci.yaml
    • .github/workflows/codeql-analysis.yml
    • .github/workflows/issue-translate.yaml
    • .github/workflows/lint-chart.yaml
    • .github/workflows/stale.yaml
    • .github/workflows/test-self-hosted.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a crucial fix for GPUs with unified memory architecture by gracefully handling cases where memory information is not available, preventing panics in the device plugin. The addition of the defaultDeviceMemory configuration provides a flexible fallback mechanism.
The PR also includes a significant number of repository improvements, such as adding comprehensive documentation, issue templates, CI/CD configurations with golangci-lint and dependabot, and a new Helm chart. These changes greatly enhance the project's maintainability and developer experience.
I've added one comment regarding a potential division-by-zero issue in the metrics collection logic for safer metric reporting. Overall, this is an excellent and substantial contribution to the project.

Comment on lines +163 to +168
ch <- prometheus.MustNewConstMetric(
nodeGPUMemoryPercentage,
prometheus.GaugeValue,
float64(devs.Device.Usedmem)/float64(devs.Device.Totalmem),
nodeID, devs.Device.ID, fmt.Sprint(devs.Device.Index),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent a potential division-by-zero if devs.Device.Totalmem is 0, it's safer to add a check before performing the division. A zero value could lead to NaN or Inf values in your metrics, which might cause issues with monitoring and alerting systems.

var memPercentage float64
if devs.Device.Totalmem > 0 {
	memPercentage = float64(devs.Device.Usedmem) / float64(devs.Device.Totalmem)
}
ch <- prometheus.MustNewConstMetric(
	nodeGPUMemoryPercentage,
	prometheus.GaugeValue,
	memPercentage,
	nodeID, devs.Device.ID, fmt.Sprint(devs.Device.Index),
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Device plugin panics on NVIDIA GB10 (DGX Spark) - GetMemoryInfo returns "Not Supported"