fix(scheduler): account for device count in multi-device GPU memory quota check#1370
fix(scheduler): account for device count in multi-device GPU memory quota check#1370
Conversation
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
22ea4f3 to
baef5f1
Compare
…rcement IsTaskAllocationOnNodeOverCapacity must account for all GPU devices when checking non-preemptible quota. Add tests for: - 2-device GPU memory request that exceeds the non-preemptible quota - 2-GPU whole request that exceeds the queue GPU limit Both tests fail RED before the fix. Related to #1369 Signed-off-by: Erez Freiberger <enoodle@gmail.com>
baef5f1 to
de7eee8
Compare
📊 Performance Benchmark ResultsComparing PR (
|
96ff369 to
7e01606
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
…ta enforcement Add support for GpuFractionsNumDevices annotation in jobs_fake, enabling multi-device GPU memory requests in integration tests. Add integration test: a 2-device GPU memory job with DeservedGPUs=1 must stay Pending. The node has 2 GPUs so the physical fit check passes — only the quota check (IsTaskAllocationOnNodeOverCapacity) can block it. Test fails RED before the fix. Related to #1369 Signed-off-by: Erez Freiberger <enoodle@gmail.com>
…uota check GetRequiredInitQuota returned per-device GPU fraction instead of total across all devices. For a 2-device request at 60% each, it returned 0.6 instead of 1.2, allowing non-preemptible jobs to exceed deserved quota. Multiply by GetNumOfGpuDevices() to match how AcceptedResourceVector tracks the same quota. Fixes #1369 Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
8372487 to
e91bce9
Compare
Merging this branch will not change overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
2 similar comments
Merging this branch will not change overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
Merging this branch will not change overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin v0.9
git worktree add -d .worktree/backport-1370-to-v0.9 origin/v0.9
cd .worktree/backport-1370-to-v0.9
git switch --create backport-1370-to-v0.9
git cherry-pick -x 07517ea31067b170e8b6b3110ef55d6b4739a03a |
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin v0.6
git worktree add -d .worktree/backport-1370-to-v0.6 origin/v0.6
cd .worktree/backport-1370-to-v0.6
git switch --create backport-1370-to-v0.6
git cherry-pick -x 07517ea31067b170e8b6b3110ef55d6b4739a03a |
|
Backport failed for Please cherry-pick the changes locally and resolve any conflicts. git fetch origin v0.12
git worktree add -d .worktree/backport-1370-to-v0.12 origin/v0.12
cd .worktree/backport-1370-to-v0.12
git switch --create backport-1370-to-v0.12
git cherry-pick -x 07517ea31067b170e8b6b3110ef55d6b4739a03a |
|
Successfully created backport PR for |
|
Successfully created backport PR for |
…uota check (#1370) Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Description
GetRequiredInitQuotareturned the per-device GPU fraction instead of the total across all devices for multi-device GPU memory requests. For a 2-device request at 60% each, it returned 0.6 instead of 1.2, allowing non-preemptible jobs to exceed the queue's deserved GPU quota.The fix multiplies by
GetNumOfGpuDevices()to match howAcceptedResourceVectortracks the same quota.Related Issues
Fixes #1369
Checklist
Breaking Changes
None
Additional Notes
Tests added at three levels:
capacity_policy_test.go— directly testsIsTaskAllocationOnNodeOverCapacitywith multi-device GPU memory PodInfoallocateGpuMemory_test.go— full allocate action with 2-device GPU memory job andDeservedGPUs=1E2e:- That is a bit too much, I can re-add it if requested.over_quota_specs.go— dynamically reads node GPU memory, requests 60% × 2 devices againstQuota=1🤖 Generated with Claude Code