Skip to content

Conversation

@lx1036
Copy link

@lx1036 lx1036 commented Jan 13, 2026

fix #79

Fix build with golang1.24 .
And add support for DXG-Spark GB10 ARM GPU node https://www.nvidia.com/en-us/products/workstations/dgx-spark/.

Test on DGX-Spark ARM node and GPU L40s X86 node, it's all ok for gpu share.

Device plugin log is like, which is expected:

2026/01/13 09:07:57 Retrieving plugins.
2026/01/13 09:07:57 Loading configuration.
2026/01/13 09:07:57 share
2026/01/13 09:07:57 1024
2026/01/13 09:07:57
Running with config:
{
  "version": "v1beta1",
  "flags": {
    "GPUStrategy": "share",
    "GPUMemoryFactor": 1024
  }
}
2026/01/13 09:07:57 Loading NVML
W0113 09:07:57.105167       1 client_config.go:543] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2026/01/13 09:07:57 Starting OS signal watcher.
I0113 09:07:57.115628       1 host_memory.go:49] get system memory total: 119GB, used: 91GB, free: 4GB
time="2026-01-13T09:07:57Z" level=info msg="set gpu memory: 122505" source="utils.go:61"
2026/01/13 09:07:57 Starting GRPC server for 'volcano.sh/gpu-memory'
2026/01/13 09:07:57 Starting to serve 'volcano.sh/gpu-memory' on /var/lib/kubelet/device-plugins/volcano.sock
2026/01/13 09:07:57 Registered device plugin for 'volcano.sh/gpu-memory' with Kubelet

@volcano-sh-bot
Copy link
Collaborator

Welcome @lx1036! It looks like this is your first PR to volcano-sh/devices 🎉

@gemini-code-assist
Copy link

Summary of Changes

Hello @lx1036, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly modernizes the project's build environment and NVIDIA GPU integration. It updates the core Go language version and switches to a newer NVIDIA NVML library, which is essential for supporting contemporary GPU architectures like the DGX-Spark GB10. The changes also bring improvements to GPU health monitoring and introduce NUMA topology reporting, contributing to more accurate resource management and stability within Kubernetes clusters utilizing these device plugins.

Highlights

  • Go Version Upgrade: The project's Go version has been upgraded from 1.14 to 1.23, and Dockerfiles for CentOS 7 and Ubuntu 20.04 now build with Go 1.23.7, paving the way for Go 1.24 compatibility.
  • NVIDIA NVML Library Update: The NVIDIA NVML binding library has been updated from github.com/NVIDIA/gpu-monitoring-tools to github.com/NVIDIA/go-nvml, bringing improved API interactions and better support for modern NVIDIA GPUs.
  • DGX-Spark GB10 GPU Support: Added explicit support for the NVIDIA DGX-Spark GB10 ARM GPU node, including logic to detect and utilize unified memory by querying host system memory when GPU memory information is unavailable via NVML.
  • Enhanced GPU Health Checking: The GPU health checking mechanism has been refactored to use the new go-nvml library, providing more robust event handling, better error reporting, and the ability to skip specific XID errors based on configuration.
  • NUMA Topology Awareness: Introduced functionality to detect and report the NUMA node affinity for each GPU, enhancing topology awareness for better resource scheduling.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@lx1036 lx1036 force-pushed the feature/add-dgx-spark branch from beab240 to 86dccfa Compare January 13, 2026 09:26
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the project to support a newer Go version and adds support for DGX-Spark GB10 GPUs. The changes include updating Dockerfiles, Go modules, and refactoring the NVIDIA device plugin to use the newer go-nvml library. The new implementation for handling device health and unified memory seems robust.

I've identified a few areas for improvement:

  • A bug in parsing memory information from the free command which could fail in certain locales.
  • Minor inconsistencies in Dockerfile practices, error messages, and logging.
  • An inconsistency in the usage of klog vs klog/v2 across different files. It would be best to standardize on klog/v2.

Also, please note that the PR title mentions Go 1.24, but the Go version is updated to 1.23 in go.mod and 1.23.7 in the Dockerfiles. Please ensure this is intentional.

Overall, this is a solid update. Addressing the identified issues will improve the code's correctness and maintainability.

Comment on lines +42 to +44
total, err1 := strconv.ParseUint(fields[1], 10, 64)
used, err2 := strconv.ParseUint(fields[2], 10, 64)
free, err3 := strconv.ParseUint(fields[3], 10, 64)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The check on line 37 correctly handles commas in the memory value by using strings.ReplaceAll. However, when parsing the total, used, and free values, this is omitted. This will cause a parsing error in locales where free -b uses commas as thousands separators. You should use strings.ReplaceAll here as well to prevent parsing failures.

Suggested change
total, err1 := strconv.ParseUint(fields[1], 10, 64)
used, err2 := strconv.ParseUint(fields[2], 10, 64)
free, err3 := strconv.ParseUint(fields[3], 10, 64)
total, err1 := strconv.ParseUint(strings.ReplaceAll(fields[1], ",", ""), 10, 64)
used, err2 := strconv.ParseUint(strings.ReplaceAll(fields[2], ",", ""), 10, 64)
free, err3 := strconv.ParseUint(strings.ReplaceAll(fields[3], ",", ""), 10, 64)

RUN wget -nv -O - https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-amd64.tar.gz \
| tar -C /usr/local -xz
ARG TARGETARCH
RUN wget -qO- https://storage.googleapis.com/golang/go1.23.7.linux-${TARGETARCH}.tar.gz | tar -C /usr/local -zx

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better maintainability, it's recommended to define the Go version as a build argument (ARG) instead of hardcoding it. Please add ARG GOLANG_VERSION=1.23.7 before this line and then use ${GOLANG_VERSION} here.

RUN wget -qO- https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-${TARGETARCH}.tar.gz | tar -C /usr/local -zx

RUN wget -nv -O - https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-amd64.tar.gz \
| tar -C /usr/local -xz
ARG TARGETARCH
RUN wget -qO- https://storage.googleapis.com/golang/go1.23.7.linux-${TARGETARCH}.tar.gz | tar -C /usr/local -zx

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better maintainability, it's recommended to define the Go version as a build argument (ARG) instead of hardcoding it. Please add ARG GOLANG_VERSION=1.23.7 before this line and then use ${GOLANG_VERSION} here.

RUN wget -qO- https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-${TARGETARCH}.tar.gz | tar -C /usr/local -zx


node, err := strconv.Atoi(string(bytes.TrimSpace(b)))
if err != nil {
return false, 0, fmt.Errorf("eror parsing value for NUMA node: %v", err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in the error message. 'eror' should be 'error'.

Suggested change
return false, 0, fmt.Errorf("eror parsing value for NUMA node: %v", err)
return false, 0, fmt.Errorf("error parsing value for NUMA node: %v", err)

}
xid, err := strconv.ParseUint(trimmed, 10, 64)
if err != nil {
log.Printf("Ignoring malformed Xid value %v: %v", trimmed, err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logging in this function uses log.Printf, which is inconsistent with the rest of the file that uses klog. For consistent logging, please use klog.Warningf here.

Suggested change
log.Printf("Ignoring malformed Xid value %v: %v", trimmed, err)
klog.Warningf("Ignoring malformed Xid value %v: %v", trimmed, err)

sigs.k8s.io/yaml v1.2.0
)

require (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not merge the depend together

@volcano-sh-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

panics when built with go1.24

3 participants