fix(scheduler): account for fractional GPU usage in spread node placement#1099
fix(scheduler): account for fractional GPU usage in spread node placement#1099
Conversation
…ment The spread strategy used NonAllocatedResource for GPU scoring, which only counted whole idle GPUs and ignored fractional capacity on shared GPUs. This caused nodes with different actual availability to score identically, leading to nondeterministic placement instead of even spreading. Use GetSumOfIdleGPUs + GetSumOfReleasingGPUs to correctly reflect partial GPU consumption in spread scores.
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR fixes the spread node placement strategy to correctly handle fractional GPU usage on shared GPUs. The core fix modifies the scoring logic to compute non-allocated resources differently for GPU versus non-GPU resources, with early returns to prevent division by zero. Changes are accompanied by unit tests and end-to-end tests for validation. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
test/e2e/modules/configurations/feature_flags/plugins.go (1)
7-17: Reorder imports into std / external / internal groups.The import block currently mixes internal packages with external dependencies (
k8s.io/utils/ptr) in the same group.Proposed import grouping
import ( "context" + "k8s.io/utils/ptr" + kaiv1 "github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1" "github.com/NVIDIA/KAI-scheduler/pkg/common/constants" "github.com/NVIDIA/KAI-scheduler/test/e2e/modules/configurations" "github.com/NVIDIA/KAI-scheduler/test/e2e/modules/constant" testContext "github.com/NVIDIA/KAI-scheduler/test/e2e/modules/context" "github.com/NVIDIA/KAI-scheduler/test/e2e/modules/wait" - "k8s.io/utils/ptr" )As per coding guidelines:
**/*.go: Organize imports in three groups separated by blank lines: (1) Standard library, (2) External dependencies, (3) Internal packages.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/modules/configurations/feature_flags/plugins.go` around lines 7 - 17, Reorder the import block so imports are split into three groups (std lib, external, internal) separated by blank lines: move "context" into the standard library group; keep "k8s.io/utils/ptr" and any other non‑repo packages (if added) in the external group; and place internal packages like "github.com/NVIDIA/KAI-scheduler/pkg/..." and "github.com/NVIDIA/KAI-scheduler/test/..." into the internal group. Update the import block containing kaiv1, constants, configurations, constant, testContext, wait and k8s.io/utils/ptr accordingly so imports follow the std/external/internal ordering.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/scheduler/actions/allocate/allocateFractionalGpu_test.go`:
- Around line 1666-1687: The current check only verifies per-node GPU-group
count (reservationsByNode) and misses asserting spread across nodes; update the
test around the reservationsByNode computation to also assert the total number
of distinct nodes used equals the expected count (e.g., 2) and that each touched
node has the expected number of pods (e.g., 2), using
ssn.ClusterInfo.PodGroupInfos and task.GPUGroups to compute per-node pod counts
and failing with t.Errorf including testNumber and
testMetadata.TestTopologyBasic.Name when the node count or per-node pod counts
differ from expectations.
In `@test/e2e/suites/allocate/node_order/fill_node_test.go`:
- Around line 58-67: The AfterAll block should restore whatever plugin and
placement settings were present before the suite ran instead of forcing
gpupack=false, gpuspread=false and DefaultPluginName; capture the initial values
at setup (e.g., in BeforeAll) by calling feature_flags.GetPluginEnabled or
equivalent for "gpupack" and "gpuspread" and feature_flags.GetPlacementStrategy
(store into variables like originalGpupackEnabled, originalGpuspreadEnabled,
originalPlacementStrategy) and then in AfterAll call
feature_flags.SetPluginEnabled(ctx, testCtx, "gpupack", originalGpupackEnabled),
feature_flags.SetPluginEnabled(ctx, testCtx, "gpuspread",
originalGpuspreadEnabled) and feature_flags.SetPlacementStrategy(ctx, testCtx,
originalPlacementStrategy), preserving the current error handling (Fail on
error) and referencing the existing symbols AfterAll,
feature_flags.SetPluginEnabled, feature_flags.SetPlacementStrategy,
DefaultPluginName, and testCtx.
- Around line 93-109: The test currently counts all pods in
constant.KaiReservationNamespace which can include unrelated reservations;
restrict to only pods created by this test run by adding a selector/filter: when
listing pods via testCtx.KubeClientset.CoreV1().Pods(...).List use a
metav1.ListOptions{LabelSelector: "<key>=<test-id>"} (or if a label isn’t
available, filter reservationPods.Items in-code by a unique test
label/annotation or by creationTimestamp >= the test start time stored on
testCtx). Update the reservationsByNode counting logic to only consider pods
that match that label/annotation or time, leaving the rest of the assertions
(reservationsByNode and loop over gpuNodes) unchanged.
---
Nitpick comments:
In `@test/e2e/modules/configurations/feature_flags/plugins.go`:
- Around line 7-17: Reorder the import block so imports are split into three
groups (std lib, external, internal) separated by blank lines: move "context"
into the standard library group; keep "k8s.io/utils/ptr" and any other non‑repo
packages (if added) in the external group; and place internal packages like
"github.com/NVIDIA/KAI-scheduler/pkg/..." and
"github.com/NVIDIA/KAI-scheduler/test/..." into the internal group. Update the
import block containing kaiv1, constants, configurations, constant, testContext,
wait and k8s.io/utils/ptr accordingly so imports follow the
std/external/internal ordering.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
CHANGELOG.mdpkg/scheduler/actions/allocate/allocateFractionalGpu_test.gopkg/scheduler/plugins/nodeplacement/nodespread_test.gopkg/scheduler/plugins/nodeplacement/spread.gotest/e2e/modules/configurations/feature_flags/plugins.gotest/e2e/suites/allocate/node_order/fill_node_test.gotest/e2e/suites/allocate/node_order/node_order_suite_test.go
📊 Performance Benchmark ResultsComparing PR ( Legend
Raw benchmark dataPR branch: Main branch: |
AfterAll now deletes plugin keys entirely instead of setting enabled=false, letting the operator reconcile the correct defaults for the active strategy.
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Merging this branch will increase overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
| idleGPUs, _ := node.GetSumOfIdleGPUs() | ||
| releasingGPUs, _ := node.GetSumOfReleasingGPUs() | ||
| nonAllocated = idleGPUs + releasingGPUs |
There was a problem hiding this comment.
Consider exporting that on the node
| func nodeResourceSpread(resourceName v1.ResourceName) api.NodeOrderFn { | ||
| return func(task *pod_info.PodInfo, node *node_info.NodeInfo) (float64, error) { | ||
| var resourceCount float64 | ||
| var nonAllocated float64 |
There was a problem hiding this comment.
Should the same logic be applied to pack.go?
Description
The
spreadnode placement strategy scored GPU availability usingNonAllocatedResource, which only counted whole idle and releasing GPUs. It ignored fractional capacity remaining on shared GPUs, causing nodes with different actual availability to score identically. This led to nondeterministic pod placement instead of even spreading for fractional GPU workloads.This PR fixes the spread scoring to use
GetSumOfIdleGPUs() + GetSumOfReleasingGPUs(), which correctly accounts for partial GPU consumption. A node with 1.5 GPUs available now scores1.5/2 = 0.75, differentiating it from a fully idle node at1.0.The
gpupackplugin (GPU-level ordering) continues to handle the "pack onto the same GPU" concern independently — it operates at theGpuOrderFnlevel, which runs after node selection.Example: Why the bug causes incorrect spreading
Setup: 2 nodes, each with 2 GPUs. 4 pods requesting 0.5 GPU each. Strategy:
spread+gpupack.Desired outcome: 2 pods per node, each pair packed onto a single GPU → 1 reservation pod per node.
Pod 1 — both nodes empty, both score
2/2 = 1.0. Tie → lands on node-A (GPU-0).Pod 2 — node-A has 1 idle GPU (
NonAllocatedResource = 1), node-B has 2. Scores:0.5vs1.0. node-B wins. ✅Pod 3 — both nodes have 1 idle GPU each (
NonAllocatedResource = 1for both — the 0.5 used on a shared GPU is invisible). Scores:0.5vs0.5. Tie → lands on either node (say node-A).gpupackpacks it onto GPU-0. ✅Pod 4 — node-A now has 2 pods on GPU-0 (1.0 used), node-B has 1 pod on GPU-0 (0.5 used):
NonAllocatedResource1/2 = 0.51/2 = 0.5Both nodes score identically. The scheduler doesn't see that node-B has 1.5 GPUs available while node-A only has 1.0. Pod 4 lands on either node nondeterministically. If it goes to node-A → node-A gets 3 pods and opens a second GPU (2 reservation pods instead of 1). ❌
With the fix, node-A scores
1.0/2 = 0.5and node-B scores1.5/2 = 0.75, correctly directing pod 4 to node-B. ✅Adding
gpusharingorderdoesn't help — it makes things worseOne might try enabling
gpusharingorderto break the tie by preferring nodes with existing shared GPU groups. But this backfires on pod 2:1/2 = 0.52/2 = 1.0+1000(GPU-0 has a shared group, pod fits)0(no shared GPUs)Pod 2 goes to node-A instead of node-B — the +1000 score overwhelms the spread signal entirely, turning "spread across nodes" into "pack onto nodes that already have shared GPUs."
Changes
spread.go: For GPU resources, computenonAllocatedas the sum of idle + releasing GPU fractions instead of usingNonAllocatedResource.nodespread_test.go: Unit test verifying that a node with shared GPU consumption scores lower than a fully idle node.allocateFractionalGpu_test.go: Integration test reproducing the E2E scenario — 4 pods (0.5 GPU each) across 2 nodes (2 GPUs each) with spread + gpupack.fill_node_test.go(new): E2E test that submits2×numNodesfractional GPU pods under spread strategy and asserts exactly 1 reservation pod per GPU node.plugins.go(new): E2E helper to enable/disable scheduler plugins via SchedulingShard patches.node_order_suite_test.go: Register the new E2E test suite.Related Issues
Fixes #
Checklist
Breaking Changes
None.
Additional Notes
The
gpusharingorderplugin is intentionally not enabled in the E2E test. It adds a +1000 node-level score to nodes with existing shared GPU groups, which actively fights the spread strategy. GPU packing within a node is handled entirely bygpupackat the GPU-order level.Summary by CodeRabbit
Release Notes
Bug Fixes
Tests