Skip to content

feat(api): add minSubGroup field to PodGroup and SubGroup API#1127

Open
omer-dayan wants to merge 8 commits intomainfrom
dev/issue-20-min-subgroups-api
Open

feat(api): add minSubGroup field to PodGroup and SubGroup API#1127
omer-dayan wants to merge 8 commits intomainfrom
dev/issue-20-min-subgroups-api

Conversation

@omer-dayan
Copy link
Collaborator

@omer-dayan omer-dayan commented Mar 4, 2026

Summary

Add minSubGroup field to PodGroup and SubGroup API to enable hierarchical elastic gang scheduling where only a minimum number of child SubGroups need to be ready.

Changes

  • Added MinSubGroup *int32 to both PodGroupSpec and SubGroup structs in podgroup_types.go
  • Extended the validating webhook (podgroup_webhook.go) with a new validatePodGroupSpec function that enforces mutual exclusivity between minMember/minSubGroup, prevents minSubGroup on leaf SubGroups, prevents minMember on mid-level SubGroups, and checks that minSubGroup doesn't exceed the actual child count
  • Updated zz_generated.deepcopy.go to properly deep-copy the new pointer fields

Tests

  • Existing TestValidateSubGroups tests preserved and extended with 7 new sub-cases covering minSubGroup on SubGroups
  • Added new TestValidatePodGroupSpec test function with 9 cases covering all PodGroup-level validation scenarios

Architecture/API Changes

Added minSubGroup *int32 optional field to PodGroupSpec and SubGroup API types — fully backward compatible as the field defaults to nil.

Summary by CodeRabbit

Release Notes

  • New Features
    • Added minSubGroup field to PodGroup and SubGroup specifications to specify the minimum number of required child SubGroups for scheduling
    • Implemented validation rules to prevent simultaneous use of minSubGroup with minMember fields, ensuring mutually exclusive configuration

KAI Dev Agent added 7 commits March 1, 2026 17:07
Adds the new MinSubGroup *int32 field to both PodGroupSpec and SubGroup
structs, enabling users to specify the minimum number of direct child
SubGroups required for elastic gang scheduling.

Validation rules enforced via the validating webhook:
- minMember and minSubGroup are mutually exclusive at both PodGroup
  and SubGroup level
- minSubGroup must be > 0 and <= number of direct child SubGroups
- Leaf SubGroups (no children) may not use minSubGroup
- Mid-level SubGroups (has children) may not use minMember

DeepCopy functions updated to handle the new pointer field.

Refs: #20
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

This PR introduces a minSubGroup field to PodGroup and SubGroup APIs for specifying minimum child SubGroup requirements in elastic gang scheduling. The changes include CRD schema updates, type definitions, webhook validation logic to enforce mutual exclusivity with minMember, comprehensive test coverage, and generated deep copy code.

Changes

Cohort / File(s) Summary
Documentation
CHANGELOG.md
Added changelog entry documenting minSubGroup field introduction for v0.13.0.
CRD Schema
deployments/kai-scheduler/crds/scheduling.run.ai_podgroups.yaml
Added minSubGroup integer fields to PodGroupSpec and SubGroup items with int32 format and minimum value of 1; updated topology constraint descriptions to reference topology hierarchy.
API Type Definitions
pkg/apis/scheduling/v2alpha2/podgroup_types.go
Added MinSubGroup pointer-to-int32 fields to PodGroupSpec and SubGroup structs with omitempty JSON tags; documented mutual exclusivity with MinMember and updated topology constraint comments.
Validation & Helpers
pkg/apis/scheduling/v2alpha2/podgroup_webhook.go
Refactored validation logic with new top-level validatePodGroupSpec function, added per-SubGroup validateSubGroupMinFields helper, introduced buildChildrenMap and countRootSubGroups utilities, and wired deterministic sorted SubGroup validation.
Test Coverage
pkg/apis/scheduling/v2alpha2/podgroup_webhook_test.go
Expanded TestValidateSubGroups with minSubGroup scenarios; added new TestValidatePodGroupSpec suite covering valid configurations, mutual exclusivity enforcement, child count bounds validation, and multi-level hierarchies.
Generated Code
pkg/apis/scheduling/v2alpha2/zz_generated.deepcopy.go
Added nil-safe deep copy handling for new MinSubGroup pointer fields in PodGroupSpec and SubGroup types.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A field hops in, minSubGroup by name,
To schedule gangs with elastic aim,
Validation guards with mutual care,
No mixing with minMember's fair share!
Tests flourish deep, our logic runs true. ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely summarizes the main change: adding a minSubGroup field to PodGroup and SubGroup API, which aligns perfectly with the changeset.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering Summary, Changes, Tests, and Architecture/API Changes, though it lacks explicit checklist confirmation and doesn't reference the related issue number.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev/issue-20-min-subgroups-api

Comment @coderabbitai help to get the list of available commands and usage tips.

@enoodle
Copy link
Collaborator

enoodle commented Mar 4, 2026

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 4, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
CHANGELOG.md (1)

7-11: ⚠️ Potential issue | 🟡 Minor

Changelog entry should be under [Unreleased] and reference the actual PR number.

The entry is placed under [v0.13.0] which is already released (2026-03-02), but this PR is still open. Per Keep a Changelog format, unreleased changes should go under the [Unreleased] section. Also, #TBD should be replaced with #1127.

📝 Proposed fix
 ## [Unreleased]
+### Added
+- Added `minSubGroup` field to PodGroup and SubGroup API to support specifying the minimum number of child SubGroups required for elastic gang scheduling, along with validation to prevent simultaneous use of `minSubGroup` and `minMember` fields [`#1127`](https://github.com/NVIDIA/KAI-Scheduler/pull/1127) by [KAI Dev Agent](https://github.com/run-ai/KAI-Agents)

 ## [v0.13.0] - 2026-03-02
 ### Added
-- Added `minSubGroup` field to PodGroup and SubGroup API to support specifying the minimum number of child SubGroups required for elastic gang scheduling, along with validation to prevent simultaneous use of `minSubGroup` and `minMember` fields (`#TBD`) by [KAI Dev Agent](https://github.com/run-ai/KAI-Agents)
 - Added `global.nodeSelector` propagation from Helm values to Config CR...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.md` around lines 7 - 11, The changelog entry for the new
minSubGroup field is incorrectly placed under [v0.13.0] and uses a placeholder
PR number; move the bullet about "Added `minSubGroup` field to PodGroup and
SubGroup API..." from the [v0.13.0] section into the [Unreleased] section,
replace the `#TBD` token with the real PR number `#1127`, and keep the rest of
the text unchanged (including referencing minSubGroup/minMember validation) so
the Unreleased section accurately reflects this open PR.
🧹 Nitpick comments (1)
pkg/apis/scheduling/v2alpha2/podgroup_webhook.go (1)

129-158: Consider: mid-level SubGroup with neither minMember nor minSubGroup.

The validation allows a mid-level SubGroup (one with children) to have neither minMember nor minSubGroup set. This may be intentional for backward compatibility, but consider whether this should be flagged as a warning or if documentation should clarify the expected behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/apis/scheduling/v2alpha2/podgroup_webhook.go` around lines 129 - 158, The
validateSubGroupMinFields function currently allows a mid-level SubGroup
(childrenMap lookup yields children, isLeaf==false) to have neither MinMember
nor MinSubGroup set; to enforce explicit intent, modify
validateSubGroupMinFields to return an error for non-leaf SubGroup when
MinSubGroup is nil (and MinMember must already be prohibited for non-leaf), i.e.
after computing isLeaf add a check like: if !isLeaf && subGroup.MinSubGroup ==
nil { return fmt.Errorf("subgroup %q: mid-level SubGroup must set minSubGroup",
subGroup.Name) }; update any callers/tests accordingly and include the
SubGroup.Name and clear message referencing MinSubGroup to make the requirement
explicit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@CHANGELOG.md`:
- Around line 7-11: The changelog entry for the new minSubGroup field is
incorrectly placed under [v0.13.0] and uses a placeholder PR number; move the
bullet about "Added `minSubGroup` field to PodGroup and SubGroup API..." from
the [v0.13.0] section into the [Unreleased] section, replace the `#TBD` token
with the real PR number `#1127`, and keep the rest of the text unchanged
(including referencing minSubGroup/minMember validation) so the Unreleased
section accurately reflects this open PR.

---

Nitpick comments:
In `@pkg/apis/scheduling/v2alpha2/podgroup_webhook.go`:
- Around line 129-158: The validateSubGroupMinFields function currently allows a
mid-level SubGroup (childrenMap lookup yields children, isLeaf==false) to have
neither MinMember nor MinSubGroup set; to enforce explicit intent, modify
validateSubGroupMinFields to return an error for non-leaf SubGroup when
MinSubGroup is nil (and MinMember must already be prohibited for non-leaf), i.e.
after computing isLeaf add a check like: if !isLeaf && subGroup.MinSubGroup ==
nil { return fmt.Errorf("subgroup %q: mid-level SubGroup must set minSubGroup",
subGroup.Name) }; update any callers/tests accordingly and include the
SubGroup.Name and clear message referencing MinSubGroup to make the requirement
explicit.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 294c1f15-a0ac-410c-b8f0-b6170a2c224c

📥 Commits

Reviewing files that changed from the base of the PR and between a38e0b8 and 35e57f8.

📒 Files selected for processing (6)
  • CHANGELOG.md
  • deployments/kai-scheduler/crds/scheduling.run.ai_podgroups.yaml
  • pkg/apis/scheduling/v2alpha2/podgroup_types.go
  • pkg/apis/scheduling/v2alpha2/podgroup_webhook.go
  • pkg/apis/scheduling/v2alpha2/podgroup_webhook_test.go
  • pkg/apis/scheduling/v2alpha2/zz_generated.deepcopy.go

@github-actions
Copy link

github-actions bot commented Mar 4, 2026

Merging this branch will increase overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/KAI-scheduler/pkg/apis/scheduling/v2alpha2 25.91% (+11.06%) 🎉

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/KAI-scheduler/pkg/apis/scheduling/v2alpha2/podgroup_types.go 7.69% (ø) 13 1 12
github.com/NVIDIA/KAI-scheduler/pkg/apis/scheduling/v2alpha2/podgroup_webhook.go 77.06% (+15.53%) 109 (+44) 84 (+44) 25 🎉
github.com/NVIDIA/KAI-scheduler/pkg/apis/scheduling/v2alpha2/zz_generated.deepcopy.go 0.00% (ø) 205 (+8) 0 205 (+8)

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/KAI-scheduler/pkg/apis/scheduling/v2alpha2/podgroup_webhook_test.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants