feat: make RayGrouper work with TopologyConstraints by rueian · Pull Request #1125 · NVIDIA/KAI-Scheduler

rueian · 2026-03-03T03:21:09Z

Description

This PR allows KubeRay users to set topologyConstraints on each subgroup through kai annotations.

Manual E2E Test

Given the following RayCluster and topology

apiVersion: kai.scheduler/v1alpha1
kind: Topology
metadata:
  name: test-topology
spec:
  levels:
  - nodeLabel: zone
  - nodeLabel: rack
  - nodeLabel: node
---
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  labels:
    kai.scheduler/queue: default-queue
  name: kai-topology-verify
  namespace: default
spec:
  rayVersion: 2.54.0
  headGroupSpec:
    template:
      spec:
        containers:
        - image: rayproject/ray:2.54.0
          name: ray-head
          resources:
            requests:
              cpu: 200m
              memory: 512Mi
        schedulerName: kai-scheduler
  workerGroupSpecs:
  - groupName: rack1
    replicas: 16
    template:
      metadata:
        annotations:
          kai.scheduler/topology: test-topology
          kai.scheduler/topology-required-placement: rack
      spec:
        containers:
        - image: rayproject/ray:2.54.0
          name: ray-worker
          resources:
            requests:
              cpu: 200m
              memory: 512Mi
        schedulerName: kai-scheduler
  - groupName: rack2
    replicas: 16
    template:
      metadata:
        annotations:
          kai.scheduler/topology: test-topology
          kai.scheduler/topology-required-placement: rack
      spec:
        containers:
        - image: rayproject/ray:2.54.0
          name: ray-worker
          resources:
            requests:
              cpu: 200m
              memory: 512Mi
        schedulerName: kai-scheduler

Will result in this PodGroup:

apiVersion: scheduling.run.ai/v2alpha2
kind: PodGroup
metadata:
  annotations:
    kai.scheduler/last-start-timestamp: "2026-03-03T18:10:57Z"
    kai.scheduler/stale-podgroup-timestamp: "2026-03-04T23:29:40Z"
    kai.scheduler/top-owner-metadata: |
      name: kai-topology-verify
      uid: 792c99b0-3fc9-4b3e-82da-1335d72f84b3
      group: ray.io
      version: v1
      kind: RayCluster
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"ray.io/v1","kind":"RayCluster","metadata":{"annotations":{},"labels":{"kai.scheduler/queue":"default-queue"},"name":"kai-topology-verify","namespace":"default"},"spec":{"headGroupSpec":{"template":{"spec":{"containers":[{"image":"rayproject/ray:2.54.0","name":"ray-head","resources":{"requests":{"cpu":"200m","memory":"512Mi"}}}],"schedulerName":"kai-scheduler"}}},"rayVersion":"2.54.0","workerGroupSpecs":[{"groupName":"rack1","replicas":16,"template":{"metadata":{"annotations":{"kai.scheduler/topology":"test-topology","kai.scheduler/topology-required-placement":"rack"}},"spec":{"containers":[{"image":"rayproject/ray:2.54.0","name":"ray-worker","resources":{"requests":{"cpu":"200m","memory":"512Mi"}}}],"schedulerName":"kai-scheduler"}}},{"groupName":"rack2","replicas":16,"template":{"metadata":{"annotations":{"kai.scheduler/topology":"test-topology","kai.scheduler/topology-required-placement":"rack"}},"spec":{"containers":[{"image":"rayproject/ray:2.54.0","name":"ray-worker","resources":{"requests":{"cpu":"200m","memory":"512Mi"}}}],"schedulerName":"kai-scheduler"}}}]}}
  creationTimestamp: "2026-03-03T18:10:55Z"
  generation: 2
  labels:
    kai.scheduler/queue: default-queue
  name: pg-kai-topology-verify-792c99b0-3fc9-4b3e-82da-1335d72f84b3
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1
    kind: RayCluster
    name: kai-topology-verify
    uid: 792c99b0-3fc9-4b3e-82da-1335d72f84b3
  resourceVersion: "2452011"
  uid: d32aa218-7225-42a9-a123-39b15b6cee81
spec:
  minMember: 33
  priorityClassName: train
  queue: default-queue
  subGroups:
  - minMember: 1
    name: headgroup
  - minMember: 16
    name: rack1
    topologyConstraint:
      requiredTopologyLevel: rack
      topology: test-topology
  - minMember: 16
    name: rack2
    topologyConstraint:
      requiredTopologyLevel: rack
      topology: test-topology
  topologyConstraint: {}
status:
  resourcesStatus:
    allocated:
      cpu: 800m
      memory: 2Gi
    requested:
      cpu: 7200m
      memory: 18Gi

Checklist

Note: Ensure your PR title follows the Conventional Commits format (e.g., feat(scheduler): add new feature)

Self-reviewed
Added/updated tests (if needed)
Updated documentation (if needed)

Breaking Changes

Additional Notes

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

coderabbitai · 2026-03-03T03:21:20Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f069567ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

github-actions · 2026-03-04T23:03:42Z

Merging this branch will decrease overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/ray	79.08% (-1.26%)	👎

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go	78.23% (-1.05%)	147 (+36)	115 (+27)	32 (+9)	👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper_test.go

enoodle · 2026-03-05T08:21:14Z

examples/ray/README.md


 ### Required Configuration

-1. **Queue Annotation**: Add `scheduling.run.ai/queue-name` annotation on the RayJob or RayCluster metadata to specify the scheduling queue


Thanks for fixing this.

feat: make RayGrouper work with TopologyConstraints

c7397b2

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

enoodle requested a review from itsomri March 3, 2026 08:08

docs: make RayGrouper work with TopologyConstraints

4f06956

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

rueian marked this pull request as ready for review March 3, 2026 18:22

chatgpt-codex-connector bot reviewed Mar 3, 2026

View reviewed changes

pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go Outdated Show resolved Hide resolved

feat: make RayGrouper work with TopologyConstraints

104d731

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>

enoodle reviewed Mar 5, 2026

View reviewed changes

itsomri approved these changes Mar 5, 2026

View reviewed changes

enoodle approved these changes Mar 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make RayGrouper work with TopologyConstraints#1125

feat: make RayGrouper work with TopologyConstraints#1125
rueian wants to merge 3 commits intoNVIDIA:mainfrom
rueian:ray-grouper-topology

rueian commented Mar 3, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 3, 2026 •

edited

Loading

Review skipped

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

github-actions bot commented Mar 4, 2026

Changed files (no unit tests)

Changed unit test files

Uh oh!

enoodle Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		### Required Configuration

		1. Queue Annotation: Add `scheduling.run.ai/queue-name` annotation on the RayJob or RayCluster metadata to specify the scheduling queue

Conversation

rueian commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Manual E2E Test

Checklist

Breaking Changes

Additional Notes

Uh oh!

coderabbitai bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

github-actions bot commented Mar 4, 2026

Merging this branch will decrease overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

enoodle Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rueian commented Mar 3, 2026 •

edited

Loading

coderabbitai bot commented Mar 3, 2026 •

edited

Loading