Skip to content

fix(scheduler): account for device count in multi-device GPU memory quota check#1370

Merged
enoodle merged 4 commits intomainfrom
erez/fix-check-over-limit
Apr 3, 2026
Merged

fix(scheduler): account for device count in multi-device GPU memory quota check#1370
enoodle merged 4 commits intomainfrom
erez/fix-check-over-limit

Conversation

@enoodle
Copy link
Copy Markdown
Collaborator

@enoodle enoodle commented Mar 31, 2026

Description

GetRequiredInitQuota returned the per-device GPU fraction instead of the total across all devices for multi-device GPU memory requests. For a 2-device request at 60% each, it returned 0.6 instead of 1.2, allowing non-preemptible jobs to exceed the queue's deserved GPU quota.

The fix multiplies by GetNumOfGpuDevices() to match how AcceptedResourceVector tracks the same quota.

Related Issues

Fixes #1369

Checklist

  • Self-reviewed
  • Added/updated tests (if needed)
  • Updated documentation (if needed)

Breaking Changes

None

Additional Notes

Tests added at three levels:

  • Unit: capacity_policy_test.go — directly tests IsTaskAllocationOnNodeOverCapacity with multi-device GPU memory PodInfo
  • Integration: allocateGpuMemory_test.go — full allocate action with 2-device GPU memory job and DeservedGPUs=1
  • E2e: over_quota_specs.go — dynamically reads node GPU memory, requests 60% × 2 devices against Quota=1 - That is a bit too much, I can re-add it if requested.

🤖 Generated with Claude Code

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 31, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9fb45831-4ff1-42ad-9d7a-fb5baa941edb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch erez/fix-check-over-limit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@enoodle enoodle force-pushed the erez/fix-check-over-limit branch from 22ea4f3 to baef5f1 Compare March 31, 2026 22:34
…rcement

IsTaskAllocationOnNodeOverCapacity must account for all GPU devices when
checking non-preemptible quota. Add tests for:
- 2-device GPU memory request that exceeds the non-preemptible quota
- 2-GPU whole request that exceeds the queue GPU limit

Both tests fail RED before the fix.

Related to #1369

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
@enoodle enoodle force-pushed the erez/fix-check-over-limit branch from baef5f1 to de7eee8 Compare March 31, 2026 22:35
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 31, 2026

📊 Performance Benchmark Results

Comparing PR (erez/fix-check-over-limit) vs main branch — click to expand
goos: linux
goarch: amd64
pkg: github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
                                    │ main-bench.txt │            pr-bench.txt            │
                                    │     sec/op     │    sec/op     vs base              │
AllocateAction_SmallCluster-4            107.9m ± 0%   107.8m ±  5%       ~ (p=1.000 n=6)
AllocateAction_MediumCluster-4           136.2m ± 4%   133.8m ±  1%  -1.74% (p=0.026 n=6)
AllocateAction_LargeCluster-4            216.2m ± 7%   207.6m ± 13%       ~ (p=0.065 n=6)
ReclaimAction_SmallCluster-4             103.4m ± 0%   103.0m ±  0%       ~ (p=0.065 n=6)
ReclaimAction_MediumCluster-4            106.4m ± 1%   106.0m ±  0%       ~ (p=0.394 n=6)
PreemptAction_SmallCluster-4             104.0m ± 0%   103.9m ±  0%       ~ (p=0.065 n=6)
PreemptAction_MediumCluster-4            112.8m ± 1%   112.7m ±  1%       ~ (p=0.485 n=6)
ConsolidationAction_SmallCluster-4       113.9m ± 1%   113.8m ±  0%       ~ (p=0.699 n=6)
ConsolidationAction_MediumCluster-4      216.7m ± 6%   202.7m ±  4%  -6.50% (p=0.002 n=6)
FullSchedulingCycle_SmallCluster-4       106.3m ± 0%   105.4m ±  0%  -0.83% (p=0.002 n=6)
FullSchedulingCycle_MediumCluster-4      121.5m ± 3%   120.4m ±  1%  -0.93% (p=0.002 n=6)
FullSchedulingCycle_LargeCluster-4       166.9m ± 4%   161.8m ±  1%  -3.01% (p=0.002 n=6)
ManyQueues_MediumCluster-4               140.1m ± 5%   136.9m ±  1%  -2.26% (p=0.002 n=6)
GangScheduling_MediumCluster-4           156.7m ± 2%   155.4m ±  1%       ~ (p=0.240 n=6)
geomean                                  131.8m        129.8m        -1.52%

                                    │ main-bench.txt │            pr-bench.txt            │
                                    │      B/op      │     B/op      vs base              │
AllocateAction_SmallCluster-4           2.254Mi ± 0%   2.255Mi ± 1%       ~ (p=0.515 n=6)
AllocateAction_MediumCluster-4          12.20Mi ± 0%   12.20Mi ± 0%       ~ (p=0.818 n=6)
AllocateAction_LargeCluster-4           42.06Mi ± 0%   42.05Mi ± 0%       ~ (p=0.394 n=6)
ReclaimAction_SmallCluster-4            962.2Ki ± 1%   960.4Ki ± 1%       ~ (p=0.818 n=6)
ReclaimAction_MediumCluster-4           3.173Mi ± 0%   3.173Mi ± 0%       ~ (p=0.937 n=6)
PreemptAction_SmallCluster-4            1.065Mi ± 0%   1.061Mi ± 0%       ~ (p=0.132 n=6)
PreemptAction_MediumCluster-4           4.322Mi ± 0%   4.322Mi ± 0%       ~ (p=0.937 n=6)
ConsolidationAction_SmallCluster-4      5.558Mi ± 0%   5.559Mi ± 0%       ~ (p=0.310 n=6)
ConsolidationAction_MediumCluster-4     46.72Mi ± 0%   46.73Mi ± 0%       ~ (p=0.065 n=6)
FullSchedulingCycle_SmallCluster-4      1.457Mi ± 0%   1.457Mi ± 0%       ~ (p=0.589 n=6)
FullSchedulingCycle_MediumCluster-4     7.185Mi ± 0%   7.185Mi ± 0%       ~ (p=0.485 n=6)
FullSchedulingCycle_LargeCluster-4      23.51Mi ± 0%   23.51Mi ± 0%       ~ (p=0.485 n=6)
ManyQueues_MediumCluster-4              16.66Mi ± 0%   16.66Mi ± 0%       ~ (p=0.065 n=6)
GangScheduling_MediumCluster-4          17.78Mi ± 0%   17.78Mi ± 0%       ~ (p=0.818 n=6)
geomean                                 6.604Mi        6.603Mi       -0.03%

                                    │ main-bench.txt │           pr-bench.txt            │
                                    │   allocs/op    │  allocs/op   vs base              │
AllocateAction_SmallCluster-4            36.32k ± 0%   36.32k ± 0%       ~ (p=0.595 n=6)
AllocateAction_MediumCluster-4           317.9k ± 0%   317.9k ± 0%       ~ (p=0.567 n=6)
AllocateAction_LargeCluster-4            1.351M ± 0%   1.351M ± 0%       ~ (p=0.909 n=6)
ReclaimAction_SmallCluster-4             8.961k ± 0%   8.960k ± 0%       ~ (p=0.621 n=6)
ReclaimAction_MediumCluster-4            29.06k ± 0%   29.06k ± 0%       ~ (p=0.288 n=6)
PreemptAction_SmallCluster-4             11.68k ± 0%   11.68k ± 0%       ~ (p=0.149 n=6)
PreemptAction_MediumCluster-4            40.99k ± 0%   40.99k ± 0%       ~ (p=1.000 n=6)
ConsolidationAction_SmallCluster-4       71.64k ± 0%   71.65k ± 0%       ~ (p=0.394 n=6)
ConsolidationAction_MediumCluster-4      675.7k ± 0%   675.8k ± 0%       ~ (p=0.143 n=6)
FullSchedulingCycle_SmallCluster-4       21.70k ± 0%   21.70k ± 0%       ~ (p=0.574 n=6)
FullSchedulingCycle_MediumCluster-4      172.3k ± 0%   172.3k ± 0%       ~ (p=0.825 n=6)
FullSchedulingCycle_LargeCluster-4       708.8k ± 0%   708.8k ± 0%       ~ (p=0.699 n=6)
ManyQueues_MediumCluster-4               355.7k ± 0%   355.8k ± 0%       ~ (p=0.180 n=6)
GangScheduling_MediumCluster-4           581.7k ± 0%   581.7k ± 0%       ~ (p=0.981 n=6)
geomean                                  112.4k        112.4k       +0.00%

pkg: github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions/integration_tests/reclaim
                            │ main-bench.txt │            pr-bench.txt            │
                            │     sec/op     │   sec/op     vs base               │
ReclaimLargeJobs_10Node-4       105.2m ±  1%   105.1m ± 1%        ~ (p=0.485 n=6)
ReclaimLargeJobs_50Node-4       140.7m ±  1%   139.9m ± 0%   -0.56% (p=0.026 n=6)
ReclaimLargeJobs_100Node-4      289.7m ±  5%   270.2m ± 1%   -6.71% (p=0.002 n=6)
ReclaimLargeJobs_200Node-4     1107.2m ±  8%   989.2m ± 4%  -10.65% (p=0.004 n=6)
ReclaimLargeJobs_500Node-4       11.50 ± 17%    11.32 ± 5%        ~ (p=0.485 n=6)
ReclaimLargeJobs_1000Node-4      103.6 ±  5%    102.6 ± 1%        ~ (p=0.310 n=6)
geomean                          1.335          1.288        -3.51%

                            │ main-bench.txt │            pr-bench.txt            │
                            │      B/op      │     B/op      vs base              │
ReclaimLargeJobs_10Node-4       1.876Mi ± 3%   1.875Mi ± 3%       ~ (p=1.000 n=6)
ReclaimLargeJobs_50Node-4       17.61Mi ± 0%   17.61Mi ± 0%       ~ (p=0.589 n=6)
ReclaimLargeJobs_100Node-4      59.73Mi ± 0%   59.71Mi ± 0%       ~ (p=0.699 n=6)
ReclaimLargeJobs_200Node-4      234.0Mi ± 0%   233.8Mi ± 0%  -0.08% (p=0.002 n=6)
ReclaimLargeJobs_500Node-4      1.658Gi ± 0%   1.658Gi ± 0%       ~ (p=0.485 n=6)
ReclaimLargeJobs_1000Node-4     8.527Gi ± 0%   8.527Gi ± 0%       ~ (p=0.699 n=6)
geomean                         137.8Mi        137.7Mi       -0.02%

                            │ main-bench.txt │           pr-bench.txt            │
                            │   allocs/op    │  allocs/op   vs base              │
ReclaimLargeJobs_10Node-4        20.34k ± 2%   20.34k ± 2%       ~ (p=0.814 n=6)
ReclaimLargeJobs_50Node-4        234.3k ± 0%   234.3k ± 0%       ~ (p=0.485 n=6)
ReclaimLargeJobs_100Node-4       872.5k ± 0%   872.5k ± 0%       ~ (p=0.550 n=6)
ReclaimLargeJobs_200Node-4       3.691M ± 0%   3.690M ± 0%  -0.00% (p=0.041 n=6)
ReclaimLargeJobs_500Node-4       29.66M ± 0%   29.66M ± 0%       ~ (p=0.485 n=6)
ReclaimLargeJobs_1000Node-4      165.8M ± 0%   165.8M ± 0%       ~ (p=0.589 n=6)
geomean                          2.056M        2.056M       -0.00%

Legend

  • 📉 Negative delta = Performance improvement (faster)
  • 📈 Positive delta = Performance regression (slower)
  • p-value < 0.05 indicates statistically significant change
Raw benchmark data

PR branch:

goos: linux
goarch: amd64
pkg: github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
BenchmarkAllocateAction_SmallCluster-4         	       9	 113649359 ns/op	 2385772 B/op	   36328 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107953021 ns/op	 2365294 B/op	   36319 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108050032 ns/op	 2365734 B/op	   36319 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107408375 ns/op	 2364058 B/op	   36317 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107708671 ns/op	 2361536 B/op	   36311 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107686532 ns/op	 2363213 B/op	   36309 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 133204191 ns/op	12791912 B/op	  317907 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 133017983 ns/op	12790903 B/op	  317904 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 133765208 ns/op	12792973 B/op	  317899 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 133838652 ns/op	12790361 B/op	  317897 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 134165488 ns/op	12789927 B/op	  317898 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 134471847 ns/op	12795155 B/op	  317901 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 208606419 ns/op	44134107 B/op	 1351100 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 207103661 ns/op	44101163 B/op	 1351090 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 207209756 ns/op	44094296 B/op	 1351081 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 233866966 ns/op	44100475 B/op	 1351074 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 207896866 ns/op	44094107 B/op	 1351085 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 206215871 ns/op	44093840 B/op	 1351075 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103389702 ns/op	  976799 B/op	    8934 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102994047 ns/op	  972952 B/op	    8946 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103087331 ns/op	  988602 B/op	    8961 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102926750 ns/op	  985788 B/op	    8963 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103002828 ns/op	  981330 B/op	    8960 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103417887 ns/op	  985512 B/op	    8961 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106437197 ns/op	 3327558 B/op	   29064 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106042334 ns/op	 3327776 B/op	   29065 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106031562 ns/op	 3320543 B/op	   29059 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106139872 ns/op	 3327589 B/op	   29064 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105798564 ns/op	 3327405 B/op	   29063 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105752680 ns/op	 3327410 B/op	   29063 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103947159 ns/op	 1116776 B/op	   11677 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103686794 ns/op	 1116850 B/op	   11677 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103765815 ns/op	 1112700 B/op	   11674 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103973120 ns/op	 1113024 B/op	   11676 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103631649 ns/op	 1113015 B/op	   11676 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103961080 ns/op	 1108738 B/op	   11673 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112946434 ns/op	 4531675 B/op	   40994 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112185212 ns/op	 4532326 B/op	   40998 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 113639079 ns/op	 4523629 B/op	   40993 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112880425 ns/op	 4534998 B/op	   40996 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112250533 ns/op	 4531991 B/op	   40996 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112596326 ns/op	 4523214 B/op	   40992 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113662788 ns/op	 5827225 B/op	   71616 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 114317889 ns/op	 5831433 B/op	   71662 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113899169 ns/op	 5824022 B/op	   71607 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113827138 ns/op	 5827968 B/op	   71640 allocs/op

Main branch:

goos: linux
goarch: amd64
pkg: github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
BenchmarkAllocateAction_SmallCluster-4         	      10	 107678348 ns/op	 2364815 B/op	   36318 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107868703 ns/op	 2364058 B/op	   36318 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107948189 ns/op	 2363037 B/op	   36316 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107938444 ns/op	 2362352 B/op	   36312 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107804825 ns/op	 2366164 B/op	   36318 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108169053 ns/op	 2361648 B/op	   36311 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 133278310 ns/op	12793434 B/op	  317914 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 135575131 ns/op	12793896 B/op	  317906 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 141124170 ns/op	12790211 B/op	  317896 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 136765224 ns/op	12792361 B/op	  317903 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 135410603 ns/op	12790713 B/op	  317899 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 137116735 ns/op	12791317 B/op	  317904 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 215086708 ns/op	44102643 B/op	 1351096 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 221289147 ns/op	44109390 B/op	 1351082 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 217373528 ns/op	44094904 B/op	 1351080 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 231843223 ns/op	44094731 B/op	 1351077 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 214946918 ns/op	44094449 B/op	 1351082 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 210634848 ns/op	44101696 B/op	 1351089 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103428449 ns/op	  976847 B/op	    8934 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103123628 ns/op	  977160 B/op	    8948 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103277851 ns/op	  985431 B/op	    8962 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103279976 ns/op	  985186 B/op	    8961 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103587707 ns/op	  985504 B/op	    8962 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103461146 ns/op	  985501 B/op	    8962 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106822478 ns/op	 3327739 B/op	   29064 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106616095 ns/op	 3328346 B/op	   29062 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106383279 ns/op	 3327368 B/op	   29063 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105694473 ns/op	 3323424 B/op	   29062 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106027170 ns/op	 3323322 B/op	   29061 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106348211 ns/op	 3327415 B/op	   29063 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103746803 ns/op	 1117000 B/op	   11679 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103949298 ns/op	 1116765 B/op	   11677 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 104184672 ns/op	 1116631 B/op	   11676 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 104033094 ns/op	 1116885 B/op	   11678 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 104079487 ns/op	 1116756 B/op	   11677 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103990252 ns/op	 1116660 B/op	   11676 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 114233304 ns/op	 4532352 B/op	   40997 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112892048 ns/op	 4523034 B/op	   40991 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 113243235 ns/op	 4539132 B/op	   40995 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112754987 ns/op	 4531687 B/op	   40995 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112611270 ns/op	 4531891 B/op	   40996 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112395779 ns/op	 4527546 B/op	   40995 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 114710514 ns/op	 5829604 B/op	   71637 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 114277577 ns/op	 5828676 B/op	   71642 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113991897 ns/op	 5826473 B/op	   71633 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113611799 ns/op	 5825348 B/op	   71621 allocs/op

@enoodle enoodle force-pushed the erez/fix-check-over-limit branch from 96ff369 to 7e01606 Compare April 1, 2026 12:31
@enoodle
Copy link
Copy Markdown
Collaborator Author

enoodle commented Apr 1, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 1, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

enoodle added 3 commits April 1, 2026 14:33
…ta enforcement

Add support for GpuFractionsNumDevices annotation in jobs_fake, enabling
multi-device GPU memory requests in integration tests.

Add integration test: a 2-device GPU memory job with DeservedGPUs=1 must
stay Pending. The node has 2 GPUs so the physical fit check passes —
only the quota check (IsTaskAllocationOnNodeOverCapacity) can block it.

Test fails RED before the fix.

Related to #1369

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
…uota check

GetRequiredInitQuota returned per-device GPU fraction instead of total
across all devices. For a 2-device request at 60% each, it returned 0.6
instead of 1.2, allowing non-preemptible jobs to exceed deserved quota.

Multiply by GetNumOfGpuDevices() to match how AcceptedResourceVector
tracks the same quota.

Fixes #1369

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
@enoodle enoodle force-pushed the erez/fix-check-over-limit branch 2 times, most recently from 8372487 to e91bce9 Compare April 1, 2026 12:33
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 1, 2026

Merging this branch will not change overall coverage

Impacted Packages Coverage Δ 🤖
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions/allocate 94.87% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/api/node_info 71.59% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/plugins/proportion/capacity_policy 98.61% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/test_utils/jobs_fake 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/api/node_info/node_info.go 73.15% (ø) 298 218 80
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/test_utils/jobs_fake/jobs.go 0.00% (ø) 147 (+2) 0 147 (+2)

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions/allocate/allocateGpuMemory_test.go
  • github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/plugins/proportion/capacity_policy/capacity_policy_test.go

2 similar comments
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 2, 2026

Merging this branch will not change overall coverage

Impacted Packages Coverage Δ 🤖
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions/allocate 94.87% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/api/node_info 71.59% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/plugins/proportion/capacity_policy 98.61% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/test_utils/jobs_fake 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/api/node_info/node_info.go 73.15% (ø) 298 218 80
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/test_utils/jobs_fake/jobs.go 0.00% (ø) 147 (+2) 0 147 (+2)

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions/allocate/allocateGpuMemory_test.go
  • github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/plugins/proportion/capacity_policy/capacity_policy_test.go

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 2, 2026

Merging this branch will not change overall coverage

Impacted Packages Coverage Δ 🤖
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions/allocate 94.87% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/api/node_info 71.59% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/plugins/proportion/capacity_policy 98.61% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/test_utils/jobs_fake 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/api/node_info/node_info.go 73.15% (ø) 298 218 80
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/test_utils/jobs_fake/jobs.go 0.00% (ø) 147 (+2) 0 147 (+2)

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions/allocate/allocateGpuMemory_test.go
  • github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/plugins/proportion/capacity_policy/capacity_policy_test.go

@enoodle enoodle added this pull request to the merge queue Apr 3, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 3, 2026
@enoodle enoodle added this pull request to the merge queue Apr 3, 2026
Merged via the queue into main with commit 07517ea Apr 3, 2026
33 of 35 checks passed
@enoodle enoodle deleted the erez/fix-check-over-limit branch April 3, 2026 22:24
@KaiPilotBot
Copy link
Copy Markdown

Backport failed for v0.9, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin v0.9
git worktree add -d .worktree/backport-1370-to-v0.9 origin/v0.9
cd .worktree/backport-1370-to-v0.9
git switch --create backport-1370-to-v0.9
git cherry-pick -x 07517ea31067b170e8b6b3110ef55d6b4739a03a

@KaiPilotBot
Copy link
Copy Markdown

Backport failed for v0.6, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin v0.6
git worktree add -d .worktree/backport-1370-to-v0.6 origin/v0.6
cd .worktree/backport-1370-to-v0.6
git switch --create backport-1370-to-v0.6
git cherry-pick -x 07517ea31067b170e8b6b3110ef55d6b4739a03a

@KaiPilotBot
Copy link
Copy Markdown

Backport failed for v0.12, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin v0.12
git worktree add -d .worktree/backport-1370-to-v0.12 origin/v0.12
cd .worktree/backport-1370-to-v0.12
git switch --create backport-1370-to-v0.12
git cherry-pick -x 07517ea31067b170e8b6b3110ef55d6b4739a03a

github-actions Bot pushed a commit that referenced this pull request Apr 3, 2026
…uota check (#1370)

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
(cherry picked from commit 07517ea)
@KaiPilotBot
Copy link
Copy Markdown

Successfully created backport PR for v0.13:

github-actions Bot pushed a commit that referenced this pull request Apr 3, 2026
…uota check (#1370)

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
(cherry picked from commit 07517ea)
@KaiPilotBot
Copy link
Copy Markdown

Successfully created backport PR for v0.14:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Non-preemptible multi-device GPU memory jobs can exceed queue deserved quota

3 participants