fix(scheduler): account for device count in multi-device GPU memory quota check by enoodle · Pull Request #1370 · kai-scheduler/KAI-Scheduler

enoodle · 2026-03-31T22:30:27Z

Description

GetRequiredInitQuota returned the per-device GPU fraction instead of the total across all devices for multi-device GPU memory requests. For a 2-device request at 60% each, it returned 0.6 instead of 1.2, allowing non-preemptible jobs to exceed the queue's deserved GPU quota.

The fix multiplies by GetNumOfGpuDevices() to match how AcceptedResourceVector tracks the same quota.

Related Issues

Fixes #1369

Checklist

Self-reviewed
Added/updated tests (if needed)
Updated documentation (if needed)

Breaking Changes

None

Additional Notes

Tests added at three levels:

Unit: capacity_policy_test.go — directly tests IsTaskAllocationOnNodeOverCapacity with multi-device GPU memory PodInfo
Integration: allocateGpuMemory_test.go — full allocate action with 2-device GPU memory job and DeservedGPUs=1
E2e: over_quota_specs.go — dynamically reads node GPU memory, requests 60% × 2 devices against Quota=1 - That is a bit too much, I can re-add it if requested.

🤖 Generated with Claude Code

coderabbitai · 2026-03-31T22:30:33Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9fb45831-4ff1-42ad-9d7a-fb5baa941edb

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch erez/fix-check-over-limit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…rcement IsTaskAllocationOnNodeOverCapacity must account for all GPU devices when checking non-preemptible quota. Add tests for: - 2-device GPU memory request that exceeds the non-preemptible quota - 2-GPU whole request that exceeds the queue GPU limit Both tests fail RED before the fix. Related to #1369 Signed-off-by: Erez Freiberger <enoodle@gmail.com>

github-actions · 2026-03-31T23:15:31Z

📊 Performance Benchmark Results

Comparing PR (erez/fix-check-over-limit) vs main branch — click to expand

goos: linux
goarch: amd64
pkg: github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
                                    │ main-bench.txt │            pr-bench.txt            │
                                    │     sec/op     │    sec/op     vs base              │
AllocateAction_SmallCluster-4            107.9m ± 0%   107.8m ±  5%       ~ (p=1.000 n=6)
AllocateAction_MediumCluster-4           136.2m ± 4%   133.8m ±  1%  -1.74% (p=0.026 n=6)
AllocateAction_LargeCluster-4            216.2m ± 7%   207.6m ± 13%       ~ (p=0.065 n=6)
ReclaimAction_SmallCluster-4             103.4m ± 0%   103.0m ±  0%       ~ (p=0.065 n=6)
ReclaimAction_MediumCluster-4            106.4m ± 1%   106.0m ±  0%       ~ (p=0.394 n=6)
PreemptAction_SmallCluster-4             104.0m ± 0%   103.9m ±  0%       ~ (p=0.065 n=6)
PreemptAction_MediumCluster-4            112.8m ± 1%   112.7m ±  1%       ~ (p=0.485 n=6)
ConsolidationAction_SmallCluster-4       113.9m ± 1%   113.8m ±  0%       ~ (p=0.699 n=6)
ConsolidationAction_MediumCluster-4      216.7m ± 6%   202.7m ±  4%  -6.50% (p=0.002 n=6)
FullSchedulingCycle_SmallCluster-4       106.3m ± 0%   105.4m ±  0%  -0.83% (p=0.002 n=6)
FullSchedulingCycle_MediumCluster-4      121.5m ± 3%   120.4m ±  1%  -0.93% (p=0.002 n=6)
FullSchedulingCycle_LargeCluster-4       166.9m ± 4%   161.8m ±  1%  -3.01% (p=0.002 n=6)
ManyQueues_MediumCluster-4               140.1m ± 5%   136.9m ±  1%  -2.26% (p=0.002 n=6)
GangScheduling_MediumCluster-4           156.7m ± 2%   155.4m ±  1%       ~ (p=0.240 n=6)
geomean                                  131.8m        129.8m        -1.52%

                                    │ main-bench.txt │            pr-bench.txt            │
                                    │      B/op      │     B/op      vs base              │
AllocateAction_SmallCluster-4           2.254Mi ± 0%   2.255Mi ± 1%       ~ (p=0.515 n=6)
AllocateAction_MediumCluster-4          12.20Mi ± 0%   12.20Mi ± 0%       ~ (p=0.818 n=6)
AllocateAction_LargeCluster-4           42.06Mi ± 0%   42.05Mi ± 0%       ~ (p=0.394 n=6)
ReclaimAction_SmallCluster-4            962.2Ki ± 1%   960.4Ki ± 1%       ~ (p=0.818 n=6)
ReclaimAction_MediumCluster-4           3.173Mi ± 0%   3.173Mi ± 0%       ~ (p=0.937 n=6)
PreemptAction_SmallCluster-4            1.065Mi ± 0%   1.061Mi ± 0%       ~ (p=0.132 n=6)
PreemptAction_MediumCluster-4           4.322Mi ± 0%   4.322Mi ± 0%       ~ (p=0.937 n=6)
ConsolidationAction_SmallCluster-4      5.558Mi ± 0%   5.559Mi ± 0%       ~ (p=0.310 n=6)
ConsolidationAction_MediumCluster-4     46.72Mi ± 0%   46.73Mi ± 0%       ~ (p=0.065 n=6)
FullSchedulingCycle_SmallCluster-4      1.457Mi ± 0%   1.457Mi ± 0%       ~ (p=0.589 n=6)
FullSchedulingCycle_MediumCluster-4     7.185Mi ± 0%   7.185Mi ± 0%       ~ (p=0.485 n=6)
FullSchedulingCycle_LargeCluster-4      23.51Mi ± 0%   23.51Mi ± 0%       ~ (p=0.485 n=6)
ManyQueues_MediumCluster-4              16.66Mi ± 0%   16.66Mi ± 0%       ~ (p=0.065 n=6)
GangScheduling_MediumCluster-4          17.78Mi ± 0%   17.78Mi ± 0%       ~ (p=0.818 n=6)
geomean                                 6.604Mi        6.603Mi       -0.03%

                                    │ main-bench.txt │           pr-bench.txt            │
                                    │   allocs/op    │  allocs/op   vs base              │
AllocateAction_SmallCluster-4            36.32k ± 0%   36.32k ± 0%       ~ (p=0.595 n=6)
AllocateAction_MediumCluster-4           317.9k ± 0%   317.9k ± 0%       ~ (p=0.567 n=6)
AllocateAction_LargeCluster-4            1.351M ± 0%   1.351M ± 0%       ~ (p=0.909 n=6)
ReclaimAction_SmallCluster-4             8.961k ± 0%   8.960k ± 0%       ~ (p=0.621 n=6)
ReclaimAction_MediumCluster-4            29.06k ± 0%   29.06k ± 0%       ~ (p=0.288 n=6)
PreemptAction_SmallCluster-4             11.68k ± 0%   11.68k ± 0%       ~ (p=0.149 n=6)
PreemptAction_MediumCluster-4            40.99k ± 0%   40.99k ± 0%       ~ (p=1.000 n=6)
ConsolidationAction_SmallCluster-4       71.64k ± 0%   71.65k ± 0%       ~ (p=0.394 n=6)
ConsolidationAction_MediumCluster-4      675.7k ± 0%   675.8k ± 0%       ~ (p=0.143 n=6)
FullSchedulingCycle_SmallCluster-4       21.70k ± 0%   21.70k ± 0%       ~ (p=0.574 n=6)
FullSchedulingCycle_MediumCluster-4      172.3k ± 0%   172.3k ± 0%       ~ (p=0.825 n=6)
FullSchedulingCycle_LargeCluster-4       708.8k ± 0%   708.8k ± 0%       ~ (p=0.699 n=6)
ManyQueues_MediumCluster-4               355.7k ± 0%   355.8k ± 0%       ~ (p=0.180 n=6)
GangScheduling_MediumCluster-4           581.7k ± 0%   581.7k ± 0%       ~ (p=0.981 n=6)
geomean                                  112.4k        112.4k       +0.00%

pkg: github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions/integration_tests/reclaim
                            │ main-bench.txt │            pr-bench.txt            │
                            │     sec/op     │   sec/op     vs base               │
ReclaimLargeJobs_10Node-4       105.2m ±  1%   105.1m ± 1%        ~ (p=0.485 n=6)
ReclaimLargeJobs_50Node-4       140.7m ±  1%   139.9m ± 0%   -0.56% (p=0.026 n=6)
ReclaimLargeJobs_100Node-4      289.7m ±  5%   270.2m ± 1%   -6.71% (p=0.002 n=6)
ReclaimLargeJobs_200Node-4     1107.2m ±  8%   989.2m ± 4%  -10.65% (p=0.004 n=6)
ReclaimLargeJobs_500Node-4       11.50 ± 17%    11.32 ± 5%        ~ (p=0.485 n=6)
ReclaimLargeJobs_1000Node-4      103.6 ±  5%    102.6 ± 1%        ~ (p=0.310 n=6)
geomean                          1.335          1.288        -3.51%

                            │ main-bench.txt │            pr-bench.txt            │
                            │      B/op      │     B/op      vs base              │
ReclaimLargeJobs_10Node-4       1.876Mi ± 3%   1.875Mi ± 3%       ~ (p=1.000 n=6)
ReclaimLargeJobs_50Node-4       17.61Mi ± 0%   17.61Mi ± 0%       ~ (p=0.589 n=6)
ReclaimLargeJobs_100Node-4      59.73Mi ± 0%   59.71Mi ± 0%       ~ (p=0.699 n=6)
ReclaimLargeJobs_200Node-4      234.0Mi ± 0%   233.8Mi ± 0%  -0.08% (p=0.002 n=6)
ReclaimLargeJobs_500Node-4      1.658Gi ± 0%   1.658Gi ± 0%       ~ (p=0.485 n=6)
ReclaimLargeJobs_1000Node-4     8.527Gi ± 0%   8.527Gi ± 0%       ~ (p=0.699 n=6)
geomean                         137.8Mi        137.7Mi       -0.02%

                            │ main-bench.txt │           pr-bench.txt            │
                            │   allocs/op    │  allocs/op   vs base              │
ReclaimLargeJobs_10Node-4        20.34k ± 2%   20.34k ± 2%       ~ (p=0.814 n=6)
ReclaimLargeJobs_50Node-4        234.3k ± 0%   234.3k ± 0%       ~ (p=0.485 n=6)
ReclaimLargeJobs_100Node-4       872.5k ± 0%   872.5k ± 0%       ~ (p=0.550 n=6)
ReclaimLargeJobs_200Node-4       3.691M ± 0%   3.690M ± 0%  -0.00% (p=0.041 n=6)
ReclaimLargeJobs_500Node-4       29.66M ± 0%   29.66M ± 0%       ~ (p=0.485 n=6)
ReclaimLargeJobs_1000Node-4      165.8M ± 0%   165.8M ± 0%       ~ (p=0.589 n=6)
geomean                          2.056M        2.056M       -0.00%

Legend

📉 Negative delta = Performance improvement (faster)
📈 Positive delta = Performance regression (slower)
p-value < 0.05 indicates statistically significant change

Raw benchmark data

PR branch:

goos: linux
goarch: amd64
pkg: github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
BenchmarkAllocateAction_SmallCluster-4         	       9	 113649359 ns/op	 2385772 B/op	   36328 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107953021 ns/op	 2365294 B/op	   36319 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108050032 ns/op	 2365734 B/op	   36319 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107408375 ns/op	 2364058 B/op	   36317 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107708671 ns/op	 2361536 B/op	   36311 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107686532 ns/op	 2363213 B/op	   36309 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 133204191 ns/op	12791912 B/op	  317907 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 133017983 ns/op	12790903 B/op	  317904 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 133765208 ns/op	12792973 B/op	  317899 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 133838652 ns/op	12790361 B/op	  317897 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 134165488 ns/op	12789927 B/op	  317898 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 134471847 ns/op	12795155 B/op	  317901 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 208606419 ns/op	44134107 B/op	 1351100 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 207103661 ns/op	44101163 B/op	 1351090 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 207209756 ns/op	44094296 B/op	 1351081 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 233866966 ns/op	44100475 B/op	 1351074 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 207896866 ns/op	44094107 B/op	 1351085 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 206215871 ns/op	44093840 B/op	 1351075 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103389702 ns/op	  976799 B/op	    8934 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102994047 ns/op	  972952 B/op	    8946 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103087331 ns/op	  988602 B/op	    8961 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 102926750 ns/op	  985788 B/op	    8963 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103002828 ns/op	  981330 B/op	    8960 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103417887 ns/op	  985512 B/op	    8961 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106437197 ns/op	 3327558 B/op	   29064 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106042334 ns/op	 3327776 B/op	   29065 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106031562 ns/op	 3320543 B/op	   29059 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106139872 ns/op	 3327589 B/op	   29064 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105798564 ns/op	 3327405 B/op	   29063 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105752680 ns/op	 3327410 B/op	   29063 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103947159 ns/op	 1116776 B/op	   11677 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103686794 ns/op	 1116850 B/op	   11677 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103765815 ns/op	 1112700 B/op	   11674 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103973120 ns/op	 1113024 B/op	   11676 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103631649 ns/op	 1113015 B/op	   11676 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103961080 ns/op	 1108738 B/op	   11673 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112946434 ns/op	 4531675 B/op	   40994 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112185212 ns/op	 4532326 B/op	   40998 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 113639079 ns/op	 4523629 B/op	   40993 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112880425 ns/op	 4534998 B/op	   40996 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112250533 ns/op	 4531991 B/op	   40996 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112596326 ns/op	 4523214 B/op	   40992 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113662788 ns/op	 5827225 B/op	   71616 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 114317889 ns/op	 5831433 B/op	   71662 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113899169 ns/op	 5824022 B/op	   71607 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113827138 ns/op	 5827968 B/op	   71640 allocs/op

Main branch:

goos: linux
goarch: amd64
pkg: github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions
cpu: Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
BenchmarkAllocateAction_SmallCluster-4         	      10	 107678348 ns/op	 2364815 B/op	   36318 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107868703 ns/op	 2364058 B/op	   36318 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107948189 ns/op	 2363037 B/op	   36316 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107938444 ns/op	 2362352 B/op	   36312 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 107804825 ns/op	 2366164 B/op	   36318 allocs/op
BenchmarkAllocateAction_SmallCluster-4         	      10	 108169053 ns/op	 2361648 B/op	   36311 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 133278310 ns/op	12793434 B/op	  317914 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 135575131 ns/op	12793896 B/op	  317906 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 141124170 ns/op	12790211 B/op	  317896 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 136765224 ns/op	12792361 B/op	  317903 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 135410603 ns/op	12790713 B/op	  317899 allocs/op
BenchmarkAllocateAction_MediumCluster-4        	       8	 137116735 ns/op	12791317 B/op	  317904 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 215086708 ns/op	44102643 B/op	 1351096 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 221289147 ns/op	44109390 B/op	 1351082 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 217373528 ns/op	44094904 B/op	 1351080 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 231843223 ns/op	44094731 B/op	 1351077 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 214946918 ns/op	44094449 B/op	 1351082 allocs/op
BenchmarkAllocateAction_LargeCluster-4         	       5	 210634848 ns/op	44101696 B/op	 1351089 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103428449 ns/op	  976847 B/op	    8934 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103123628 ns/op	  977160 B/op	    8948 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103277851 ns/op	  985431 B/op	    8962 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103279976 ns/op	  985186 B/op	    8961 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103587707 ns/op	  985504 B/op	    8962 allocs/op
BenchmarkReclaimAction_SmallCluster-4          	      10	 103461146 ns/op	  985501 B/op	    8962 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106822478 ns/op	 3327739 B/op	   29064 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106616095 ns/op	 3328346 B/op	   29062 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106383279 ns/op	 3327368 B/op	   29063 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 105694473 ns/op	 3323424 B/op	   29062 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106027170 ns/op	 3323322 B/op	   29061 allocs/op
BenchmarkReclaimAction_MediumCluster-4         	      10	 106348211 ns/op	 3327415 B/op	   29063 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103746803 ns/op	 1117000 B/op	   11679 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103949298 ns/op	 1116765 B/op	   11677 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 104184672 ns/op	 1116631 B/op	   11676 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 104033094 ns/op	 1116885 B/op	   11678 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 104079487 ns/op	 1116756 B/op	   11677 allocs/op
BenchmarkPreemptAction_SmallCluster-4          	      10	 103990252 ns/op	 1116660 B/op	   11676 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 114233304 ns/op	 4532352 B/op	   40997 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112892048 ns/op	 4523034 B/op	   40991 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 113243235 ns/op	 4539132 B/op	   40995 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112754987 ns/op	 4531687 B/op	   40995 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112611270 ns/op	 4531891 B/op	   40996 allocs/op
BenchmarkPreemptAction_MediumCluster-4         	       9	 112395779 ns/op	 4527546 B/op	   40995 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 114710514 ns/op	 5829604 B/op	   71637 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 114277577 ns/op	 5828676 B/op	   71642 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113991897 ns/op	 5826473 B/op	   71633 allocs/op
BenchmarkConsolidationAction_SmallCluster-4    	       9	 113611799 ns/op	 5825348 B/op	   71621 allocs/op

enoodle · 2026-04-01T12:31:28Z

@coderabbitai review

coderabbitai · 2026-04-01T12:31:34Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

…ta enforcement Add support for GpuFractionsNumDevices annotation in jobs_fake, enabling multi-device GPU memory requests in integration tests. Add integration test: a 2-device GPU memory job with DeservedGPUs=1 must stay Pending. The node has 2 GPUs so the physical fit check passes — only the quota check (IsTaskAllocationOnNodeOverCapacity) can block it. Test fails RED before the fix. Related to #1369 Signed-off-by: Erez Freiberger <enoodle@gmail.com>

…uota check GetRequiredInitQuota returned per-device GPU fraction instead of total across all devices. For a 2-device request at 60% each, it returned 0.6 instead of 1.2, allowing non-preemptible jobs to exceed deserved quota. Multiply by GetNumOfGpuDevices() to match how AcceptedResourceVector tracks the same quota. Fixes #1369 Signed-off-by: Erez Freiberger <enoodle@gmail.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>

github-actions · 2026-04-01T13:21:20Z

Merging this branch will not change overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions/allocate	94.87% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/api/node_info	71.59% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/plugins/proportion/capacity_policy	98.61% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/test_utils/jobs_fake	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/api/node_info/node_info.go	73.15% (ø)	298	218	80
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/test_utils/jobs_fake/jobs.go	0.00% (ø)	147 (+2)	0	147 (+2)

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions/allocate/allocateGpuMemory_test.go
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/plugins/proportion/capacity_policy/capacity_policy_test.go

github-actions · 2026-04-02T08:08:00Z

Merging this branch will not change overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions/allocate	94.87% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/api/node_info	71.59% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/plugins/proportion/capacity_policy	98.61% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/test_utils/jobs_fake	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/api/node_info/node_info.go	73.15% (ø)	298	218	80
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/test_utils/jobs_fake/jobs.go	0.00% (ø)	147 (+2)	0	147 (+2)

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions/allocate/allocateGpuMemory_test.go
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/plugins/proportion/capacity_policy/capacity_policy_test.go

github-actions · 2026-04-02T19:28:12Z

Merging this branch will not change overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions/allocate	94.87% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/api/node_info	71.59% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/plugins/proportion/capacity_policy	98.61% (ø)
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/test_utils/jobs_fake	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/api/node_info/node_info.go	73.15% (ø)	298	218	80
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/test_utils/jobs_fake/jobs.go	0.00% (ø)	147 (+2)	0	147 (+2)

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/actions/allocate/allocateGpuMemory_test.go
github.com/kai-scheduler/KAI-scheduler/pkg/scheduler/plugins/proportion/capacity_policy/capacity_policy_test.go

KaiPilotBot · 2026-04-03T22:25:05Z

Backport failed for v0.9, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin v0.9
git worktree add -d .worktree/backport-1370-to-v0.9 origin/v0.9
cd .worktree/backport-1370-to-v0.9
git switch --create backport-1370-to-v0.9
git cherry-pick -x 07517ea31067b170e8b6b3110ef55d6b4739a03a

KaiPilotBot · 2026-04-03T22:25:06Z

Backport failed for v0.6, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin v0.6
git worktree add -d .worktree/backport-1370-to-v0.6 origin/v0.6
cd .worktree/backport-1370-to-v0.6
git switch --create backport-1370-to-v0.6
git cherry-pick -x 07517ea31067b170e8b6b3110ef55d6b4739a03a

KaiPilotBot · 2026-04-03T22:25:08Z

Backport failed for v0.12, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin v0.12
git worktree add -d .worktree/backport-1370-to-v0.12 origin/v0.12
cd .worktree/backport-1370-to-v0.12
git switch --create backport-1370-to-v0.12
git cherry-pick -x 07517ea31067b170e8b6b3110ef55d6b4739a03a

…uota check (#1370) Signed-off-by: Erez Freiberger <enoodle@gmail.com> (cherry picked from commit 07517ea)

KaiPilotBot · 2026-04-03T22:25:11Z

Successfully created backport PR for v0.13:

fix(scheduler): account for device count in multi-device GPU memory quota check #1375

…uota check (#1370) Signed-off-by: Erez Freiberger <enoodle@gmail.com> (cherry picked from commit 07517ea)

KaiPilotBot · 2026-04-03T22:25:14Z

Successfully created backport PR for v0.14:

fix(scheduler): account for device count in multi-device GPU memory quota check #1376

…uota check (#1370) Signed-off-by: Erez Freiberger <enoodle@gmail.com>

…uota check (#1370) Signed-off-by: Erez Freiberger <enoodle@gmail.com> (cherry picked from commit 07517ea)

enoodle force-pushed the erez/fix-check-over-limit branch from 22ea4f3 to baef5f1 Compare March 31, 2026 22:34

enoodle force-pushed the erez/fix-check-over-limit branch from baef5f1 to de7eee8 Compare March 31, 2026 22:35

enoodle force-pushed the erez/fix-check-over-limit branch from 96ff369 to 7e01606 Compare April 1, 2026 12:31

enoodle added 3 commits April 1, 2026 14:33

docs(changelog): add entry for multi-device GPU memory quota fix (#1369)

e91bce9

Signed-off-by: Erez Freiberger <enoodle@gmail.com>

enoodle force-pushed the erez/fix-check-over-limit branch 2 times, most recently from 8372487 to e91bce9 Compare April 1, 2026 12:33

enoodle added backport v0.9 backport v0.6 backport v0.12 backport v0.13 backport v0.14 PRs that need backport to v0.14 branch labels Apr 1, 2026

davidLif approved these changes Apr 1, 2026

View reviewed changes

enoodle added this pull request to the merge queue Apr 3, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 3, 2026

enoodle added this pull request to the merge queue Apr 3, 2026

Merged via the queue into main with commit 07517ea Apr 3, 2026
33 of 35 checks passed

enoodle deleted the erez/fix-check-over-limit branch April 3, 2026 22:24

github-actions Bot pushed a commit that referenced this pull request Apr 3, 2026

fix(scheduler): account for device count in multi-device GPU memory q…

f6e6a26

…uota check (#1370) Signed-off-by: Erez Freiberger <enoodle@gmail.com> (cherry picked from commit 07517ea)

KaiPilotBot mentioned this pull request Apr 3, 2026

fix(scheduler): account for device count in multi-device GPU memory quota check #1375

Merged

github-actions Bot pushed a commit that referenced this pull request Apr 3, 2026

fix(scheduler): account for device count in multi-device GPU memory q…

a521993

…uota check (#1370) Signed-off-by: Erez Freiberger <enoodle@gmail.com> (cherry picked from commit 07517ea)

KaiPilotBot mentioned this pull request Apr 3, 2026

fix(scheduler): account for device count in multi-device GPU memory quota check #1376

Merged

davidLif pushed a commit that referenced this pull request Apr 5, 2026

fix(scheduler): account for device count in multi-device GPU memory q…

a8d9637

…uota check (#1370) Signed-off-by: Erez Freiberger <enoodle@gmail.com>

enoodle added a commit that referenced this pull request Apr 8, 2026

fix(scheduler): account for device count in multi-device GPU memory q…

283cb90

…uota check (#1370) Signed-off-by: Erez Freiberger <enoodle@gmail.com> (cherry picked from commit 07517ea)

Conversation

enoodle commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Checklist

Breaking Changes

Additional Notes

Uh oh!

coderabbitai Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

github-actions Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Performance Benchmark Results

Legend

Uh oh!

enoodle commented Apr 1, 2026

Uh oh!

coderabbitai Bot commented Apr 1, 2026

Uh oh!

github-actions Bot commented Apr 1, 2026

Merging this branch will not change overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

github-actions Bot commented Apr 2, 2026

Merging this branch will not change overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

github-actions Bot commented Apr 2, 2026

Merging this branch will not change overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

Uh oh!

Uh oh!

KaiPilotBot commented Apr 3, 2026

Uh oh!

KaiPilotBot commented Apr 3, 2026

Uh oh!

KaiPilotBot commented Apr 3, 2026

Uh oh!

KaiPilotBot commented Apr 3, 2026

Uh oh!

KaiPilotBot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

enoodle commented Mar 31, 2026 •

edited

Loading

coderabbitai Bot commented Mar 31, 2026 •

edited

Loading

github-actions Bot commented Mar 31, 2026 •

edited

Loading