-
Notifications
You must be signed in to change notification settings - Fork 49
fix issue79: Add support for golang 1.24 and dgx-spark GB10 GPU #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Welcome @lx1036! It looks like this is your first PR to volcano-sh/devices 🎉 |
Summary of ChangesHello @lx1036, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly modernizes the project's build environment and NVIDIA GPU integration. It updates the core Go language version and switches to a newer NVIDIA NVML library, which is essential for supporting contemporary GPU architectures like the DGX-Spark GB10. The changes also bring improvements to GPU health monitoring and introduce NUMA topology reporting, contributing to more accurate resource management and stability within Kubernetes clusters utilizing these device plugins. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Signed-off-by: lx1036 <lx1036@126.com>
beab240 to
86dccfa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request updates the project to support a newer Go version and adds support for DGX-Spark GB10 GPUs. The changes include updating Dockerfiles, Go modules, and refactoring the NVIDIA device plugin to use the newer go-nvml library. The new implementation for handling device health and unified memory seems robust.
I've identified a few areas for improvement:
- A bug in parsing memory information from the
freecommand which could fail in certain locales. - Minor inconsistencies in Dockerfile practices, error messages, and logging.
- An inconsistency in the usage of
klogvsklog/v2across different files. It would be best to standardize onklog/v2.
Also, please note that the PR title mentions Go 1.24, but the Go version is updated to 1.23 in go.mod and 1.23.7 in the Dockerfiles. Please ensure this is intentional.
Overall, this is a solid update. Addressing the identified issues will improve the code's correctness and maintainability.
| total, err1 := strconv.ParseUint(fields[1], 10, 64) | ||
| used, err2 := strconv.ParseUint(fields[2], 10, 64) | ||
| free, err3 := strconv.ParseUint(fields[3], 10, 64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check on line 37 correctly handles commas in the memory value by using strings.ReplaceAll. However, when parsing the total, used, and free values, this is omitted. This will cause a parsing error in locales where free -b uses commas as thousands separators. You should use strings.ReplaceAll here as well to prevent parsing failures.
| total, err1 := strconv.ParseUint(fields[1], 10, 64) | |
| used, err2 := strconv.ParseUint(fields[2], 10, 64) | |
| free, err3 := strconv.ParseUint(fields[3], 10, 64) | |
| total, err1 := strconv.ParseUint(strings.ReplaceAll(fields[1], ",", ""), 10, 64) | |
| used, err2 := strconv.ParseUint(strings.ReplaceAll(fields[2], ",", ""), 10, 64) | |
| free, err3 := strconv.ParseUint(strings.ReplaceAll(fields[3], ",", ""), 10, 64) |
| RUN wget -nv -O - https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-amd64.tar.gz \ | ||
| | tar -C /usr/local -xz | ||
| ARG TARGETARCH | ||
| RUN wget -qO- https://storage.googleapis.com/golang/go1.23.7.linux-${TARGETARCH}.tar.gz | tar -C /usr/local -zx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better maintainability, it's recommended to define the Go version as a build argument (ARG) instead of hardcoding it. Please add ARG GOLANG_VERSION=1.23.7 before this line and then use ${GOLANG_VERSION} here.
RUN wget -qO- https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-${TARGETARCH}.tar.gz | tar -C /usr/local -zx
| RUN wget -nv -O - https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-amd64.tar.gz \ | ||
| | tar -C /usr/local -xz | ||
| ARG TARGETARCH | ||
| RUN wget -qO- https://storage.googleapis.com/golang/go1.23.7.linux-${TARGETARCH}.tar.gz | tar -C /usr/local -zx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better maintainability, it's recommended to define the Go version as a build argument (ARG) instead of hardcoding it. Please add ARG GOLANG_VERSION=1.23.7 before this line and then use ${GOLANG_VERSION} here.
RUN wget -qO- https://storage.googleapis.com/golang/go${GOLANG_VERSION}.linux-${TARGETARCH}.tar.gz | tar -C /usr/local -zx
|
|
||
| node, err := strconv.Atoi(string(bytes.TrimSpace(b))) | ||
| if err != nil { | ||
| return false, 0, fmt.Errorf("eror parsing value for NUMA node: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } | ||
| xid, err := strconv.ParseUint(trimmed, 10, 64) | ||
| if err != nil { | ||
| log.Printf("Ignoring malformed Xid value %v: %v", trimmed, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logging in this function uses log.Printf, which is inconsistent with the rest of the file that uses klog. For consistent logging, please use klog.Warningf here.
| log.Printf("Ignoring malformed Xid value %v: %v", trimmed, err) | |
| klog.Warningf("Ignoring malformed Xid value %v: %v", trimmed, err) |
| sigs.k8s.io/yaml v1.2.0 | ||
| ) | ||
|
|
||
| require ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not merge the depend together
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hzxuzhonghu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
fix #79
Fix build with golang1.24 .
And add support for DXG-Spark GB10 ARM GPU node https://www.nvidia.com/en-us/products/workstations/dgx-spark/.
Test on DGX-Spark ARM node and GPU L40s X86 node, it's all ok for gpu share.
Device plugin log is like, which is expected: