feat(scheduler): add node nouse gpuuuid function #1206

ZhengW22 · 2025-07-15T08:30:24Z

What type of PR is this?

What this PR does / why we need it:
This PR adds the capability to disable GPUs at the node level by applying annotations to nodes. GPUs matching the specified UUIDs will no longer be allocated to any pods.

The implementation works by setting the used count of the corresponding node GPUs to their maximum capacity when calculating nodeUsage, effectively occupying those resources. This approach maintains compatibility with scheduling logic for different types of GPU cards.

Which issue(s) this PR fixes:
No.

Special notes for your reviewer:
No.

Does this PR introduce a user-facing change?:
No.

* update documents for config Signed-off-by: limengxuan <391013634@qq.com>

* Setting devicePlugin.compatWithCPUManager=true will set PASS_DEVICE_SPECS=true as an environment variable. Signed-off-by: 张驰 <919474320@qq.com> * Change the parameter compatWithCPUManager for setting the PASS_DEVICE_SPECS ENV to passDeviceSpecsEnabled, and set the default value to true. Signed-off-by: 张驰 <919474320@qq.com> --------- Signed-off-by: 张驰 <919474320@qq.com>

Signed-off-by: learner0810 <zhongjun.li@daocloud.io>

…ler. (Project-HAMi#746) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

…t-HAMi#735) Signed-off-by: elrondwong <gfengwong@gmail.com>

Signed-off-by: bin <bin.pan@daocloud.io>

Signed-off-by: elrondwong <gfengwong@gmail.com>

Signed-off-by: bin <bin.pan@daocloud.io>

* add star history to readme, fix typos and add more contributors and maintainers. Signed-off-by: yangshiqi <yangshiqi1089@gmail.com> * add spaces Signed-off-by: yangshiqi <yangshiqi1089@gmail.com> --------- Signed-off-by: yangshiqi <yangshiqi1089@gmail.com>

Signed-off-by: Fengyang <495160589@qq.com>

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

Signed-off-by: KubeKyrie <shaolong.qin@daocloud.io>

Signed-off-by: yxxhero <aiopsclub@163.com>

Signed-off-by: bin <bin.pan@daocloud.io>

…t-HAMi#767) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

Signed-off-by: KubeKyrie <shaolong.qin@daocloud.io>

Signed-off-by: learner0810 <zhongjun.li@daocloud.io>

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

Signed-off-by: wen.rui <wen.rui@daocloud.io>

Signed-off-by: Rei1010 <wen.rui@daocloud.io> Signed-off-by: wen.rui <wen.rui@daocloud.io>

Signed-off-by: jinye <jinye.shi@daocloud.io>

Bumps [docker/build-push-action](https://github.com/docker/build-push-action) from 6.10.0 to 6.11.0. - [Release notes](https://github.com/docker/build-push-action/releases) - [Commits](docker/build-push-action@v6.10.0...v6.11.0) --- updated-dependencies: - dependency-name: docker/build-push-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: jinye <jinye.shi@daocloud.io>

…anagement (Project-HAMi#778) Signed-off-by: haitwang-cloud <haitao_wht@outlook.com>

Signed-off-by: wen.rui <wen.rui@daocloud.io>

Signed-off-by: lixd <xueduan.li@gmail.com>

Signed-off-by: wen.rui <wen.rui@daocloud.io>

Signed-off-by: phoenixwu0229 <phoenixwu0229@icloud.com>

Signed-off-by: ZhengW22 <s123zz123@gmail.com>

hami-robot · 2026-01-29T11:03:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Shouren, ZhengW22

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [Shouren]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ZhengW22 · 2026-01-29T11:04:26Z

/gemini review

ZhengW22 · 2026-01-29T11:05:06Z

@ZhengW22 Could you please take a look at the critical comments in gemini-code-assist?

I have already fix the critical comments and update code to new version.

ZhengW22 · 2026-01-29T11:05:42Z

@wawa0210 Hello, please help to review this pr.

gemini-code-assist

Code Review

This pull request introduces a feature to disable GPUs at the node level using annotations, which is a valuable addition. However, my review has identified a few issues.

First, the PR is very large and bundles several unrelated changes. Besides the main feature, it includes a significant amount of new code for the NVIDIA device plugin and OCI runtime handling, as well as constants for scheduler policies. These additions are not mentioned in the PR description, making it difficult to review and understand their purpose and impact. It's highly recommended to split these unrelated changes into separate pull requests to follow best practices for code review.

Second, there's a discrepancy between the PR description and the implementation. The description states that disabled GPUs are handled by 'setting the used count... to their maximum capacity', but the code actually filters them out from the scheduler's list of available devices. Please update the description to match the implementation.

My review comments below focus on the 'disable GPU' feature. I've found a bug in the implementation and a major inconsistency in the added documentation that should be addressed.

gemini-code-assist · 2026-01-29T11:06:14Z

docs/enflame-gcu-suport.md

+## Device UUID Selection
+
+You can specify which GPU devices to use or exclude using annotations:
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: poddemo
+  annotations:
+    # Use specific GPU devices (comma-separated list)
+    enflame.com/use-gpuuuid: "node1-enflame-0,node1-enflame-1"
+    # Or exclude specific GPU devices (comma-separated list)
+    enflame.com/nouse-gpuuuid: "node1-enflame-2,node1-enflame-3"
+spec:
+  # ... rest of pod spec
+```
+
+> **NOTE:** The device ID format is `{node-name}-enflame-{index}`. You can find the available device IDs in the node status.


This "Device UUID Selection" section seems to describe a different feature than what is implemented in this pull request. The documentation describes pod-level annotations (enflame.com/use-gpuuuid, enflame.com/nouse-gpuuuid) for selecting GPUs, while the PR implements a node-level annotation for disabling GPUs from being scheduled. This is confusing. The documentation should be updated to accurately describe the new node-level feature. Additionally, the implementation in pkg/scheduler/nodes.go does not seem to include support for Enflame devices in the vendorNoUseAnnoKeyMap, so this feature wouldn't work for Enflame devices as documented.

gemini-code-assist · 2026-01-29T11:06:14Z

pkg/scheduler/nodes.go

 	} else {
 		m.nodes[nodeID] = nodeInfo
 	}
+	nodeInfo.Devices = rmDeviceByNodeAnnotation(m.nodes[nodeID])


There's a bug here. The nodeInfo.Devices is updated, but nodeInfo is the function argument. If the node already exists in m.nodes, this change will not be persisted in the node manager's state because you are modifying the argument, not the value stored in the map m.nodes. The change should be applied to m.nodes[nodeID].Devices.

Suggested change

nodeInfo.Devices = rmDeviceByNodeAnnotation(m.nodes[nodeID])

m.nodes[nodeID].Devices = rmDeviceByNodeAnnotation(m.nodes[nodeID])

gemini-code-assist · 2026-01-29T11:06:14Z

pkg/scheduler/nodes.go

+	}
+	newDeviceMap := make(map[string][]device.DeviceInfo)
+	for deviceName, deviceList := range nodeInfo.Devices {
+		newDeviceList := make([]device.DeviceInfo, 0)


For better performance, it's a good practice to pre-allocate the slice with a known capacity to avoid multiple re-allocations when appending elements in the loop. You can initialize newDeviceList with a capacity equal to the length of deviceList.

Suggested change

newDeviceList := make([]device.DeviceInfo, 0)

newDeviceList := make([]device.DeviceInfo, 0, len(deviceList))

ZhengW22 · 2026-01-29T11:26:13Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a feature to disable specific GPUs on a node using annotations. The implementation filters out the disabled GPUs from the scheduler's node cache. While the overall approach is sound, I've found a bug in how disabled GPU UUIDs are processed when there are spaces in the annotation value. I've also included suggestions to improve the test coverage and fix a redundant test case. Lastly, please note that the implementation differs from the PR description, which states that it works by maxing out the GPU's used resources; the code actually removes the device from the scheduler's view.

gemini-code-assist · 2026-01-29T11:27:46Z

pkg/scheduler/nodes.go

+							if vendorWithDisableGPUUUIDMap[vendor] == nil {
+								vendorWithDisableGPUUUIDMap[vendor] = make(map[string]bool)
+							}
+							vendorWithDisableGPUUUIDMap[vendor][disableGPUUUID] = true


There is a bug here. You are using disableGPUUUID as the map key, which is the raw value from splitting the annotation string. However, you should be using id, which is the result of strings.TrimSpace(disableGPUUUID). If the annotation value has spaces around the UUID (e.g., "uuid1, uuid2"), this will cause the lookup to fail.

Suggested change

vendorWithDisableGPUUUIDMap[vendor][disableGPUUUID] = true

vendorWithDisableGPUUUIDMap[vendor][id] = true

gemini-code-assist · 2026-01-29T11:27:46Z

pkg/scheduler/nodes_test.go

 	}
 }
+
+func Test_rmDeviceByNodeAnnotation(t *testing.T) {


The current tests don't cover cases where the comma-separated UUID list in the annotation contains spaces (e.g., "uuid1, uuid2"). This could hide a bug where strings.TrimSpace is not being used correctly when populating the map of disabled UUIDs. Please consider adding a test case to cover this scenario to make the tests more robust.

gemini-code-assist · 2026-01-29T11:27:46Z

pkg/scheduler/nodes_test.go

+			name: "Test removing metax device, case2",
+			args: args{
+				nodeInfo: &device.NodeInfo{
+					Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{metax.MetaxNoUseUUID: id1}}},
+					// Devices: []device.DeviceInfo{{DeviceVendor: metax.MetaxSGPUDevice, ID: id1}},
+					Devices: map[string][]device.DeviceInfo{metax.MetaxGPUDevice: []device.DeviceInfo{{DeviceVendor: metax.MetaxGPUDevice, ID: id1}}},
+				},
+			},
+			want: map[string][]device.DeviceInfo{metax.MetaxGPUDevice: []device.DeviceInfo{}},
+		},
+	}


This test case, Test removing metax device, case2, is a duplicate of Test removing metax device, case1. Based on the vendorNoUseAnnoKeyMap and the commented-out code, the intention was likely to test the removal of a metax.MetaxSGPUDevice. The current test does not cover this. I've suggested a corrected test case to properly test this scenario.

{ name: "Test removing metax sgpu device", args: args{ nodeInfo: &device.NodeInfo{ Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{metax.MetaxNoUseUUID: id1}}}, Devices: map[string][]device.DeviceInfo{metax.MetaxSGPUDevice: {{DeviceVendor: metax.MetaxSGPUDevice, ID: id1}}}, }, }, want: map[string][]device.DeviceInfo{metax.MetaxSGPUDevice: []device.DeviceInfo{}}, },

ZhengW22 · 2026-01-30T01:03:29Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a feature to disable specific GPUs on a node using annotations. The implementation filters out the disabled devices from the scheduler's cache when a node is added or updated. This is achieved by introducing a new function rmDeviceByNodeAnnotation and a map vendorNoUseAnnoKeyMap to handle vendor-specific annotations. The changes are accompanied by a comprehensive set of unit tests.

My review has identified a couple of areas for improvement:

A potential nil pointer dereference in rmDeviceByNodeAnnotation if it were ever called with a nil nodeInfo.
Some commented-out dead code in the new tests that should be cleaned up.

Also, please note that the implementation in the code (removing devices from the scheduler's cache) differs from the approach described in the PR description ("setting the used count... to their maximum capacity"). The implemented approach seems more effective, but it would be good to align the description with the code for future reference.

gemini-code-assist · 2026-01-30T01:05:11Z

pkg/scheduler/nodes.go

+}
+
+func rmDeviceByNodeAnnotation(nodeInfo *device.NodeInfo) map[string][]device.DeviceInfo {
+	vendorWithDisableGPUUUIDMap := make(map[string]map[string]bool)


The function rmDeviceByNodeAnnotation does not check if nodeInfo is nil. If a nil nodeInfo is passed, it could cause a panic at nodeInfo.Node. Although the current call site in addNode seems to prevent this, adding a nil check at the beginning of the function is a good practice for robustness and to make the function safer for future use.

if nodeInfo == nil { return nil } vendorWithDisableGPUUUIDMap := make(map[string]map[string]bool)

gemini-code-assist · 2026-01-30T01:05:11Z

pkg/scheduler/nodes_test.go

+					// Devices: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}, {DeviceVendor: nvidia.NvidiaGPUDevice, ID: id2}},
+					Devices: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}, {DeviceVendor: nvidia.NvidiaGPUDevice, ID: id2}}},
+				},
+			},
+			want: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{}},
+		},
+		{
+			name: "Test remove one device",
+			args: args{
+				nodeInfo: &device.NodeInfo{
+					Node:    &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{nvidia.GPUNoUseUUID: id1}}},
+					Devices: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}},
+				},
+			},
+			want: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{}},
+		},
+		{
+			name: "Test remove two devices",
+			args: args{
+				nodeInfo: &device.NodeInfo{
+					Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{nvidia.GPUNoUseUUID: strings.Join([]string{id1, id2}, ",")}}},
+					// Devices: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}, {DeviceVendor: nvidia.NvidiaGPUDevice, ID: id2}},
+					Devices: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}, {DeviceVendor: nvidia.NvidiaGPUDevice, ID: id2}}},
+				},
+			},
+			want: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{}},
+		},
+		{
+			name: "Test remove one device and keep one device",
+			args: args{
+				nodeInfo: &device.NodeInfo{
+					Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{nvidia.GPUNoUseUUID: strings.Join([]string{id2}, ",")}}},
+					// Devices: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}, {DeviceVendor: nvidia.NvidiaGPUDevice, ID: id2}},
+					Devices: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}, {DeviceVendor: nvidia.NvidiaGPUDevice, ID: id2}}},
+				},
+			},
+			// want: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
+			want: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}},
+		},
+		{
+			name: "Test no removing device, case1",
+			args: args{
+				nodeInfo: &device.NodeInfo{
+					Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{"test-key": ""}}},
+					// Devices: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
+					Devices: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}},
+				},
+			},
+			// want: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
+			want: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}},
+		},
+		{
+			name: "Test no removing device, case2",
+			args: args{
+				nodeInfo: &device.NodeInfo{
+					Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{nvidia.GPUNoUseUUID: id2}}},
+					// Devices: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
+					Devices: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}},
+				},
+			},
+			// want: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}},
+			want: map[string][]device.DeviceInfo{nvidia.NvidiaGPUDevice: []device.DeviceInfo{{DeviceVendor: nvidia.NvidiaGPUDevice, ID: id1}}},
+		},
+		{
+			name: "Test removing metax device, case1",
+			args: args{
+				nodeInfo: &device.NodeInfo{
+					Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{metax.MetaxNoUseUUID: id1}}},
+					// Devices: []device.DeviceInfo{{DeviceVendor: metax.MetaxGPUDevice, ID: id1}},
+					Devices: map[string][]device.DeviceInfo{metax.MetaxGPUDevice: []device.DeviceInfo{{DeviceVendor: metax.MetaxGPUDevice, ID: id1}}},


There are several commented-out lines of code within the new test cases (e.g., lines 347, 368, 379, 383, 415). These lines appear to be remnants from previous refactoring. To improve code clarity and maintainability, please remove this dead code.

Signed-off-by: ZhengW22 <s123zz123@gmail.com>

ZhengW22 · 2026-01-30T01:40:31Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a useful feature for disabling specific GPUs on a node via annotations. The implementation is clean and is supported by a good set of unit tests. The core logic in rmDeviceByNodeAnnotation correctly filters out devices based on the nouse-gpuuuid annotations for various vendors.

However, there is a discrepancy between the implementation and the pull request description. The description states that the feature works by setting the used count of disabled GPUs to their maximum capacity. The actual code removes these GPUs from the scheduler's cached list of devices for the node. While the outcome is similar (the GPU is not schedulable), the mechanism is different. Please update the PR description to accurately reflect the implementation, which will help future developers understand the code.

ZhengW22 · 2026-01-30T01:47:13Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a feature to disable specific GPUs on a node using annotations. The implementation works by filtering out the disabled devices from the scheduler's cache, preventing them from being allocated. This is a clean and effective approach. The code is well-tested with various scenarios.

I have a couple of suggestions for improvement:

The PR description seems to describe a different implementation strategy (marking devices as fully used). It would be beneficial to update it to accurately reflect the current implementation, which removes devices from the cache.
I've pointed out a design consideration regarding the new vendorNoUseAnnoKeyMap that could improve modularity and maintainability in the long run.

Overall, this is a solid contribution that adds valuable functionality.

gemini-code-assist · 2026-01-30T01:49:07Z

pkg/scheduler/nodes.go

+var vendorNoUseAnnoKeyMap = map[string][]string{
+	nvidia.GPUNoUseUUID:        {nvidia.NvidiaGPUDevice},
+	cambricon.MLUNoUseUUID:     {cambricon.CambriconMLUDevice},
+	hygon.DCUNoUseUUID:         {hygon.HygonDCUDevice},
+	mthreads.MthreadsNoUseUUID: {mthreads.MthreadsGPUDevice},
+	metax.MetaxNoUseUUID:       {metax.MetaxGPUDevice, metax.MetaxSGPUDevice},
+	kunlun.KunlunNoUseUUID:     {kunlun.KunlunGPUDevice},
+}


While this map works, it introduces a tight coupling between the scheduler package and all specific device packages (nvidia, cambricon, etc.). Every time a new device type with a 'no-use' annotation is added, this central map will need to be modified.

A more decoupled and maintainable approach would be to have each device implementation provide this information. For example, you could add a method to the device.Devices interface:

// In pkg/device/devices.go type Devices interface { // ... GetNoUseAnnotation() (key string, vendors []string) }

The scheduler could then build this map dynamically at startup by iterating over the registered device implementations. This would make the system more extensible.

gemini-code-assist · 2026-01-30T01:49:07Z

pkg/scheduler/nodes_test.go

+			args: args{
+				nodeInfo: &device.NodeInfo{
+					Node: &corev1.Node{ObjectMeta: metav1.ObjectMeta{Annotations: map[string]string{metax.MetaxNoUseUUID: id1}}},
+					// Devices: []device.DeviceInfo{{DeviceVendor: metax.MetaxGPUDevice, ID: id1}},


This commented-out line appears to be a remnant from development and can be safely removed to improve code clarity.

ZhengW22 · 2026-01-30T01:52:20Z

@wawa0210 Hello, I have already fixed all critical problems.

Signed-off-by: ZhengW22 <s123zz123@gmail.com>

hami-robot · 2026-01-30T02:04:53Z

Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits.

📝 Please follow instructions in the contributing guide to update your commits with the DCO

Full details of the Developer Certificate of Origin can be found at developercertificate.org.

The list of commits missing DCO signoff:

c301f9b Add unit tests for cambricon's Fit Function (Add unit tests for cambricon's Fit Function #1198)
5dd402f Add Ascend unit tests, mainly for Fit Function (Add unit tests for Ascend's Fit Function #1197)
56ffbcb refactor: clean up code and improve maintainability (refactor: clean up code and improve maintainability #1195)
0aa7bb8 fix: benchmarks/ai-benchmark/Dockerfile to reduce vulnerabilities ([Snyk] Security upgrade tensorflow/tensorflow from latest-gpu to 2.20.0rc0-gpu #1243)
523c47a Removed expired files and fix problems.
bac7c67 Fix some problems.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ZhengW22 · 2026-01-30T02:06:01Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a useful feature to disable specific GPUs on a node via annotations. The implementation is clean and integrates well with the existing node management logic. The accompanying tests are thorough and cover important edge cases.

I've noticed a small discrepancy between the PR description and the code. The description states that disabled GPUs are marked as fully utilized, but the implementation filters them out from the scheduler's device list entirely. The implemented approach is cleaner and more efficient, so this is great!

I have a couple of suggestions for minor improvements to enhance performance and maintainability.

gemini-code-assist · 2026-01-30T02:08:00Z

pkg/scheduler/nodes.go

+	newDeviceMap := make(map[string][]device.DeviceInfo)
+	for deviceName, deviceList := range nodeInfo.Devices {
+		newDeviceList := make([]device.DeviceInfo, 0, len(deviceList))
+		for _, d := range deviceList {
+			if disableGPUUUIDMap, ok := vendorWithDisableGPUUUIDMap[d.DeviceVendor]; ok {
+				if disabled := disableGPUUUIDMap[d.ID]; disabled {
+					klog.V(5).Infof("Disable gpu uuid is : %s", d.ID)
+					continue
+				}
+			}
+			newDeviceList = append(newDeviceList, d)
+		}
+		newDeviceMap[deviceName] = newDeviceList
+	}


The logic for filtering devices is correct. However, for better performance and readability, the check for whether a vendor has devices to disable can be hoisted out of the inner loop. Since d.DeviceVendor will be the same as deviceName for all devices in deviceList, you can check for vendorWithDisableGPUUUIDMap[deviceName] once before iterating through the deviceList.

newDeviceMap := make(map[string][]device.DeviceInfo, len(nodeInfo.Devices)) for deviceName, deviceList := range nodeInfo.Devices { disableGPUUUIDMap, ok := vendorWithDisableGPUUUIDMap[deviceName] if !ok { newDeviceMap[deviceName] = deviceList continue } newDeviceList := make([]device.DeviceInfo, 0, len(deviceList)) for _, d := range deviceList { if disabled := disableGPUUUIDMap[d.ID]; disabled { klog.V(5).Infof("Disable gpu uuid is : %s", d.ID) continue } newDeviceList = append(newDeviceList, d) } newDeviceMap[deviceName] = newDeviceList }

gemini-code-assist · 2026-01-30T02:08:00Z

pkg/scheduler/nodes_test.go

+			if got := rmDeviceByNodeAnnotation(tt.args.nodeInfo); !reflect.DeepEqual(got, tt.want) {
+				t.Errorf("rmDeviceByNodeAnnotation() = %v, want %v", got, tt.want)
+			}


For consistency with other tests in this file (e.g., Test_addNode_ListNodes), it's better to use assert.DeepEqual for comparing the expected and actual results. This improves the maintainability and uniformity of the test suite.

got := rmDeviceByNodeAnnotation(tt.args.nodeInfo) assert.DeepEqual(t, tt.want, got)

archlitchi and others added 30 commits December 24, 2024 17:02

Update documents for config & config_cn (Project-HAMi#729)

311d18b

* update documents for config Signed-off-by: limengxuan <391013634@qq.com>

fix device-plugin-version (Project-HAMi#743)

6353e9e

Signed-off-by: learner0810 <zhongjun.li@daocloud.io>

feat: Return the nodes that failed to be scheduled back to the schedu…

35813b5

…ler. (Project-HAMi#746) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

fix(log): fix missing log output in nvidiadeviceplugin server (Projec…

b6ad0c6

…t-HAMi#735) Signed-off-by: elrondwong <gfengwong@gmail.com>

support configuration resources limits and requests

5839076

Signed-off-by: bin <bin.pan@daocloud.io>

feat(test): add TestMarshalNodeDevices scenarios

96adaec

Signed-off-by: elrondwong <gfengwong@gmail.com>

print flags for device-plugin and scheduler

ac1d91d

Signed-off-by: bin <bin.pan@daocloud.io>

Add a mind map to help understand this project (Project-HAMi#764)

9fdbd5f

Signed-off-by: Fengyang <495160589@qq.com>

update config pages (Project-HAMi#760)

f92c7ca

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

add ut for device-map

f8c97ee

Signed-off-by: KubeKyrie <shaolong.qin@daocloud.io>

refactor(ci): use go.mod file for Go version in workflows

b8548c3

Signed-off-by: yxxhero <aiopsclub@163.com>

support set log level for device plugin (Project-HAMi#771)

abf6ed2

Signed-off-by: bin <bin.pan@daocloud.io>

feat: Restart/Upgrade device-plugin will not affect services. (Projec…

28d21fe

…t-HAMi#767) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

add ut for Nvml-Devices

c6b969b

Signed-off-by: KubeKyrie <shaolong.qin@daocloud.io>

add ut for device-map

ee7447a

Signed-off-by: KubeKyrie <shaolong.qin@daocloud.io>

Optimize the time format layout

cbccbd4

Signed-off-by: learner0810 <zhongjun.li@daocloud.io>

fix: nvidia-device-plugin no version info (Project-HAMi#779)

d04fc8b

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

Implement initial E2E test suite setup (Project-HAMi#775)

594ef64

Signed-off-by: wen.rui <wen.rui@daocloud.io>

Draft E2E test design (Project-HAMi#633)

c35d648

Signed-off-by: Rei1010 <wen.rui@daocloud.io> Signed-off-by: wen.rui <wen.rui@daocloud.io>

add ut for device/iluvatar

ca9289e

Signed-off-by: jinye <jinye.shi@daocloud.io>

add ut for device/hygon

3bb4643

Signed-off-by: jinye <jinye.shi@daocloud.io>

add ut for pkg/monitor/nvidia/v1

880c6ed

Signed-off-by: jinye <jinye.shi@daocloud.io>

chore: update logging messages and improve error handling in device m…

72621e8

…anagement (Project-HAMi#778) Signed-off-by: haitwang-cloud <haitao_wht@outlook.com>

Enrich pod health check (Project-HAMi#801)

e4d4323

Signed-off-by: wen.rui <wen.rui@daocloud.io>

docs: fix broken link

1a1bdb6

Signed-off-by: lixd <xueduan.li@gmail.com>

Optimize the E2E execution logic (Project-HAMi#803)

f585dd5

Signed-off-by: wen.rui <wen.rui@daocloud.io>

optimize MetricsBindAddress to MetricsBindPort

4b1cda2

Signed-off-by: phoenixwu0229 <phoenixwu0229@icloud.com>

ZhengW22 added 2 commits January 29, 2026 18:51

Compatible with the new version.

2ff8e53

Signed-off-by: ZhengW22 <s123zz123@gmail.com>

Merge branch 'nouse-gpu-uuid' and resolve conflicts using theirs

66868d4

hami-robot bot added dco-signoff: no approved size/XXL and removed dco-signoff: yes size/L labels Jan 29, 2026

gemini-code-assist bot reviewed Jan 29, 2026

View reviewed changes

Removed expired files and fix problems.

523c47a

hami-robot bot added size/L and removed size/XXL labels Jan 29, 2026

gemini-code-assist bot reviewed Jan 29, 2026

View reviewed changes

Fix some problems.

bac7c67

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

Fix some problems.

ab4fc6a

Signed-off-by: ZhengW22 <s123zz123@gmail.com>

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

Fix lint problems.

4f40024

Signed-off-by: ZhengW22 <s123zz123@gmail.com>

gemini-code-assist bot reviewed Jan 30, 2026

View reviewed changes

	nodeInfo.Devices = rmDeviceByNodeAnnotation(m.nodes[nodeID])
	m.nodes[nodeID].Devices = rmDeviceByNodeAnnotation(m.nodes[nodeID])

	newDeviceList := make([]device.DeviceInfo, 0)
	newDeviceList := make([]device.DeviceInfo, 0, len(deviceList))

	vendorWithDisableGPUUUIDMap[vendor][disableGPUUUID] = true
	vendorWithDisableGPUUUIDMap[vendor][id] = true

feat(scheduler): add node nouse gpuuuid function #1206

Are you sure you want to change the base?

feat(scheduler): add node nouse gpuuuid function #1206

Conversation

ZhengW22 commented Jul 15, 2025

Uh oh!

hami-robot bot commented Jan 29, 2026

Uh oh!

ZhengW22 commented Jan 29, 2026

Uh oh!

ZhengW22 commented Jan 29, 2026

Uh oh!

ZhengW22 commented Jan 29, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

ZhengW22 commented Jan 29, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

ZhengW22 commented Jan 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

ZhengW22 commented Jan 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

ZhengW22 commented Jan 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

ZhengW22 commented Jan 30, 2026

Uh oh!

hami-robot bot commented Jan 30, 2026

Uh oh!

ZhengW22 commented Jan 30, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment