Skip to content

feat: make RayGrouper work with TopologyConstraints#1125

Open
rueian wants to merge 3 commits intoNVIDIA:mainfrom
rueian:ray-grouper-topology
Open

feat: make RayGrouper work with TopologyConstraints#1125
rueian wants to merge 3 commits intoNVIDIA:mainfrom
rueian:ray-grouper-topology

Conversation

@rueian
Copy link

@rueian rueian commented Mar 3, 2026

Description

This PR allows KubeRay users to set topologyConstraints on each subgroup through kai annotations.

Manual E2E Test

Given the following RayCluster and topology

apiVersion: kai.scheduler/v1alpha1
kind: Topology
metadata:
  name: test-topology
spec:
  levels:
  - nodeLabel: zone
  - nodeLabel: rack
  - nodeLabel: node
---
apiVersion: ray.io/v1
kind: RayCluster
metadata:
  labels:
    kai.scheduler/queue: default-queue
  name: kai-topology-verify
  namespace: default
spec:
  rayVersion: 2.54.0
  headGroupSpec:
    template:
      spec:
        containers:
        - image: rayproject/ray:2.54.0
          name: ray-head
          resources:
            requests:
              cpu: 200m
              memory: 512Mi
        schedulerName: kai-scheduler
  workerGroupSpecs:
  - groupName: rack1
    replicas: 16
    template:
      metadata:
        annotations:
          kai.scheduler/topology: test-topology
          kai.scheduler/topology-required-placement: rack
      spec:
        containers:
        - image: rayproject/ray:2.54.0
          name: ray-worker
          resources:
            requests:
              cpu: 200m
              memory: 512Mi
        schedulerName: kai-scheduler
  - groupName: rack2
    replicas: 16
    template:
      metadata:
        annotations:
          kai.scheduler/topology: test-topology
          kai.scheduler/topology-required-placement: rack
      spec:
        containers:
        - image: rayproject/ray:2.54.0
          name: ray-worker
          resources:
            requests:
              cpu: 200m
              memory: 512Mi
        schedulerName: kai-scheduler

Will result in this PodGroup:

apiVersion: scheduling.run.ai/v2alpha2
kind: PodGroup
metadata:
  annotations:
    kai.scheduler/last-start-timestamp: "2026-03-03T18:10:57Z"
    kai.scheduler/stale-podgroup-timestamp: "2026-03-04T23:29:40Z"
    kai.scheduler/top-owner-metadata: |
      name: kai-topology-verify
      uid: 792c99b0-3fc9-4b3e-82da-1335d72f84b3
      group: ray.io
      version: v1
      kind: RayCluster
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"ray.io/v1","kind":"RayCluster","metadata":{"annotations":{},"labels":{"kai.scheduler/queue":"default-queue"},"name":"kai-topology-verify","namespace":"default"},"spec":{"headGroupSpec":{"template":{"spec":{"containers":[{"image":"rayproject/ray:2.54.0","name":"ray-head","resources":{"requests":{"cpu":"200m","memory":"512Mi"}}}],"schedulerName":"kai-scheduler"}}},"rayVersion":"2.54.0","workerGroupSpecs":[{"groupName":"rack1","replicas":16,"template":{"metadata":{"annotations":{"kai.scheduler/topology":"test-topology","kai.scheduler/topology-required-placement":"rack"}},"spec":{"containers":[{"image":"rayproject/ray:2.54.0","name":"ray-worker","resources":{"requests":{"cpu":"200m","memory":"512Mi"}}}],"schedulerName":"kai-scheduler"}}},{"groupName":"rack2","replicas":16,"template":{"metadata":{"annotations":{"kai.scheduler/topology":"test-topology","kai.scheduler/topology-required-placement":"rack"}},"spec":{"containers":[{"image":"rayproject/ray:2.54.0","name":"ray-worker","resources":{"requests":{"cpu":"200m","memory":"512Mi"}}}],"schedulerName":"kai-scheduler"}}}]}}
  creationTimestamp: "2026-03-03T18:10:55Z"
  generation: 2
  labels:
    kai.scheduler/queue: default-queue
  name: pg-kai-topology-verify-792c99b0-3fc9-4b3e-82da-1335d72f84b3
  namespace: default
  ownerReferences:
  - apiVersion: ray.io/v1
    kind: RayCluster
    name: kai-topology-verify
    uid: 792c99b0-3fc9-4b3e-82da-1335d72f84b3
  resourceVersion: "2452011"
  uid: d32aa218-7225-42a9-a123-39b15b6cee81
spec:
  minMember: 33
  priorityClassName: train
  queue: default-queue
  subGroups:
  - minMember: 1
    name: headgroup
  - minMember: 16
    name: rack1
    topologyConstraint:
      requiredTopologyLevel: rack
      topology: test-topology
  - minMember: 16
    name: rack2
    topologyConstraint:
      requiredTopologyLevel: rack
      topology: test-topology
  topologyConstraint: {}
status:
  resourcesStatus:
    allocated:
      cpu: 800m
      memory: 2Gi
    requested:
      cpu: 7200m
      memory: 18Gi

Checklist

Note: Ensure your PR title follows the Conventional Commits format (e.g., feat(scheduler): add new feature)

  • Self-reviewed
  • Added/updated tests (if needed)
  • Updated documentation (if needed)

Breaking Changes

Additional Notes

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 3, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@enoodle enoodle requested a review from itsomri March 3, 2026 08:08
Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
@rueian rueian marked this pull request as ready for review March 3, 2026 18:22
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f069567ad

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: Rueian Huang <rueiancsie@gmail.com>
@github-actions
Copy link

github-actions bot commented Mar 4, 2026

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/ray 79.08% (-1.26%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go 78.23% (-1.05%) 147 (+36) 115 (+27) 32 (+9) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/KAI-scheduler/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper_test.go


### Required Configuration

1. **Queue Annotation**: Add `scheduling.run.ai/queue-name` annotation on the RayJob or RayCluster metadata to specify the scheduling queue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants