fix(binder): check active BindRequests before deleting reservation pods#1104
fix(binder): check active BindRequests before deleting reservation pods#1104
Conversation
Create an integration test using envtest that starts the full binder controller and reproduces the race condition where SyncForGpuGroup prematurely deletes reservation pods when the informer cache lags behind (fraction pods lack GPU group labels). Setup: 4 nodes x 8 GPUs = 32 GPU groups with pre-created reservation pods and BindRequests. The binder's SyncForNode sees reservation pods with no matching labeled fraction pods and deletes them, confirming the bug exists before the fix.
…fraction binding Creates many concurrent 0.5 fraction GPU pods under binpack mode across multiple rounds to reproduce the race where SyncForGpuGroup prematurely deletes reservation pods due to informer cache lag. Verified to fail reliably on the current main branch (5/64 pods stuck on second run).
…ondition Add reservation_race_test.go that reproduces the race where SyncForGpuGroup prematurely deletes reservation pods when the informer cache hasn't propagated GPU group labels on recently-bound fraction pods. The test creates the exact preconditions (reservation pod + active BindRequest, no labeled fraction pod) and calls SyncForGpuGroup to verify behavior. Also exports the resource reservation service in suite_test.go so the integration test can call SyncForGpuGroup directly.
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Comment |
Before deleting a reservation pod during SyncForGpuGroup, check if any non-terminal BindRequests reference the GPU group. This prevents premature deletion caused by informer cache lag where the GPU group label on a recently-bound fraction pod has not yet propagated to the cache. The BindRequest serves as a durable intent signal that survives the cache lag window, making the sync logic idempotent against concurrent binding operations.
ee29169 to
97a03f4
Compare
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
davidLif
left a comment
There was a problem hiding this comment.
Do we need tests both on the env-test and the e2e levels? How long do they take?
| br.Status.Phase == schedulingv1alpha2.BindRequestPhaseFailed { | ||
| continue | ||
| } | ||
| if slices.Contains(br.Spec.SelectedGPUGroups, gpuGroup) { |
There was a problem hiding this comment.
Maybe you should add an index to easily get all binding requests for the gpu group?
📊 Performance Benchmark ResultsComparing PR ( Legend
Raw benchmark dataPR branch: Main branch: |
Description
Fixes a race condition where the binder's resource reservation sync logic (
syncForPods) prematurely deletes GPU reservation pods during concurrent fractional GPU binding. The race occurs because the informer cache may show a newly-created reservation pod before the corresponding user pod'srunai-gpu-grouplabel patch has propagated, causing the sync to conclude no fraction pods exist for the GPU group and delete the reservation.The fix adds a check for active (non-succeeded, non-failed) BindRequests referencing the GPU group before deleting a reservation pod. Since BindRequests are created before binding starts and persist until completion, they serve as a durable intent signal that prevents premature cleanup during the cache lag window.
Related Issues
Fixes #1103
Checklist
Breaking Changes
None.
Additional Notes
Design Document
See
docs/developer/designs/fix-reservation-pod-race/design.mdfor the full analysis.Verification
origin/main): 31/32 survive — 1 reservation pod is prematurely deleted, confirming the raceCommits
aafe0f2d366b3be266b41b87ee29169f