Skip to content

Releases: NVIDIA/KAI-Scheduler

v0.12.17

04 Mar 16:52
6f8c8d7

Choose a tag to compare

What's Changed

  • fix: skip runtimeClassName injection when gpuPodRuntimeClassName is e… by @enoodle in #1130

Full Changelog: v0.12.16...v0.12.17

v0.9.14

04 Mar 16:51
29b76ef

Choose a tag to compare

What's Changed

  • refactor: Represent podreferences as strings v0.9 by @itsomri in #985
  • fix(scheduler): bind plugin server to localhost by @gshaibi in #996
  • ci: add approval gatekeeper workflow for external contributor PRs by @KaiPilotBot in #1003
  • fix(queue-controller): use Spec.Queue field indexer for resource aggregation (#1049) by @gshaibi in #1053
  • chore: auto-resolve CHANGELOG.md merge conflicts with union strategy by @KaiPilotBot in #1054
  • fix: skip runtimeClassName injection when gpuPodRuntimeClassName is e… by @enoodle in #1131

Full Changelog: v0.9.13...v0.9.14

v0.13.0

02 Mar 10:32
5d35eca

Choose a tag to compare

What's Changed

Added

  • Added global.nodeSelector propagation from Helm values to Config CR, ensuring operator-created sub-component deployments (admission, binder, scheduler, pod-grouper, etc.) receive the configured nodeSelector #1102 yuanchen8911
  • Added plugins and actions fields to SchedulingShard spec, allowing per-shard customization of scheduler plugin/action enablement, priority, and arguments gshaibi
  • Added support for Kubeflow Trainer v2 TrainJob workloads via skipTopOwner grouper pattern
  • Added binder.cdiEnabled Helm value to allow explicit override of CDI auto-detection for environments without ClusterPolicy
  • Added metric for tracking evicted pods in pod groups, including nodepool, eviction action, and gang size
  • Block scheduling of pods with shared (non-template) DRA GPU claims that lack a queue label or have a mismatched queue label gshaibi
  • Added the option to disable prometheus service monitor creation #810 itsomri
  • Fixed prometheus instance deprecation - ensure single instance #779 itsomri
  • Added clear error messages for jobs referencing missing or orphan queues, reporting via events and conditions #820 gshaibi
  • Added rule selector for resource accounting prometheus #818 itsomri
  • Made accounting labels configurable #818 itsomri
  • Added support for Grove hierarchical topology constraints in PodGroup subgroups
  • Added support for n-level queue hierarchies #858 gshaibi
  • Added labels and annotations propagation from topOwner in SkipTopOwner grouper #861 SiorMeir
  • Added scheduler name match conditions to admission webhooks to improve cluster stability
  • Add Gpu Dra claims and resource slices accounting for the purpose of resource management and quota guarantees. *** This change doesn't support shared gpu claims or gpu claims with FirstAvailable *** #900 davidLif
  • Added DRA resources recording to snapshot #830
  • Temporarily Prevent device-plugin GPU pods on DRA-only nodes - until translation between device-plugin notation and DRA is implemented
  • Implemented subgroups for pytorchjobs #935 itsomri
  • Made KAI images distroless #745 dttung2905
  • Allow setting empty gpuPodRuntimeClassName during helm install #972 steved
  • Created scale tests scenarios for running scale tests for KAI #967
  • Implemented block-level segmentation for pytorchjobs #938 itsomri
  • Added scale test environment setup script and updated service monitors for KAI scheduler #1031
  • Implemented subgroups for leaderworkerset #1046 davidLif
  • Added discovery data to snapshot for more accurate debugging #1047 itsomri
  • Implemented subgroup segmentation (with topology segment definitions) for leaderworkerset #1058 davidLif

Fixed

  • Fixed operator status conditions to be kstatus-compatible for Helm 4 --wait support: added Ready condition and fixed Reconciling condition to properly transition to false after reconciliation completes #1060
  • Fixed a bug where the node scale adjuster would not check if a pod was unschedulable before creating a scaling pod leading to unnecessary node scaling #1094 slaupster
  • Fixed admission webhook to skip runtimeClassName injection when gpuPodRuntimeClassName is empty #1035
  • Fixed topology-migration helm hook failing on OpenShift due to missing kai-topology-migration service account in the kai-system SCC #1050
  • Fixed a bug where queue status did not reflect its podgroups resources correctly #1049
  • Fixed helm uninstall does not remove webhooks #959 faizan-exe
  • Fixed security vulnerability where PodGang could reference pods in other namespaces, preventing cross-namespace manipulation
  • Fixed pod controller logging to use request namespace/name instead of empty pod object fields when pod is not found
  • Fixed a bug where topology constrains with equal required and preferred levels would cause preferred level not to be found.
  • Fixed GPU memory pods Fair Share and Queue Order calculations
  • Interpret negative or zero half-life value as disabled #818 itsomri
  • Handle invalid CSI StorageCapacities gracefully #817 rich7420
  • Embed CRD definitions in binary for env-test and time-aware-simulations to allow binary portability #818 itsomri
  • Fixed missing podGrouper configuration in Helm template that prevented podgrouper values from being applied #860
  • Fixed rollback for failed bind attempts #847 itsomri
  • Fixed missing namespace, serviceAccountName, and appLabel fields in resourceReservation section of kai-config Helm template #860 dttung2905
  • If a preferred topology constraint is set, do not try to find a lowest common subtree (as a part of the calculations optimizations) which is lower then the preferred level
  • Added dedicated usage-prometheus service for scheduler Prometheus access with configurable instance name #896 itsomri
  • ClusterPolicy CDI parsing for gpu-operator > v25.10.0
  • Fixed missing repository, tag, and pullPolicy fields in resourceReservationImage section of kai-config Helm template #895 dttung2905
  • Fixed a bug in ray gang scheduling where not all worker groups' minMember would be respected #924 itsomri
  • cpu-only nodes calculation in DRA enabled clusters #944
  • enable DRA flag override fix in snapshot-tool #955
  • Fixed ConfigMap predicate to respect the Optional field and now considers ConfigMaps in projected volumes and ephemeral containers
  • Fixed simulations that failed due to pod capacity on node #969 itsomri
  • Fixed a bug where some resource claims would remain marked as bound to devices forever

Changed

  • Removed the constraint that prohibited direct nesting of subgroups alongside podsets within the same subgroupset.
  • Fixed plugin server (snapshot and job-order endpoints) listening on all interfaces by binding to localhost only.
  • Removed redundant connection field from GlobalConfig in favor of Prometheus.ExternalPrometheusUrl for external Prometheus URL configuration

New Contributors

Full Changelog: v0.12.0...v0.13.0

v0.12.16

02 Mar 12:50
9111f6d

Choose a tag to compare

What's Changed

Fixed

  • Fixed operator status conditions to be kstatus-compatible for Helm 4 --wait support: added Ready condition and fixed Reconciling condition to properly transition to false after reconciliation completes #1060

Full Changelog: v0.12.15...v0.12.16

v0.12.15

25 Feb 10:44
5e32fa4

Choose a tag to compare

What's Changed

Fixed

  • Fixed a bug where queue status did not reflect its podgroups resources correctly #1049
  • Fixed topology-migration helm hook failing on OpenShift due to missing kai-topology-migration service account in the kai-system SCC #1050

Full Changelog: v0.12.14...v0.12.15

v0.12.14

18 Feb 13:28
58322ee

Choose a tag to compare

What's Changed

  • Allow configuration of plugins/actions from helm #1026 itsomri

Full Changelog: v0.12.13...v0.12.14

v0.12.13

17 Feb 13:29
c08ed22

Choose a tag to compare

What's Changed

Added

  • Added plugins and actions fields to SchedulingShard spec, allowing per-shard customization of scheduler plugin/action enablement, priority, and arguments #966 gshaibi

Full Changelog: v0.12.12...v0.12.13

v0.12.12

12 Feb 14:32
166473f

Choose a tag to compare

What's Changed

Fixed

  • Fixed a bug in ray gang scheduling where not all worker groups' minMember would be respected #962 itsomri
  • Fixed plugin server (snapshot and job-order endpoints) listening on all interfaces by binding to localhost only.

Full Changelog: v0.12.11...v0.12.12

v0.12.11

04 Feb 09:38
5c3065f

Choose a tag to compare

What's Changed

Fixed

  • Added binder.cdiEnabled Helm value to allow explicit override of CDI auto-detection for environments without ClusterPolicy fixing compatibility issues in Openshift

Full Changelog: v0.12.10...v0.12.11

v0.12.10

26 Jan 13:24
0018270

Choose a tag to compare

What's Changed

  • fix(scheduler): Remove pod-name label from bindingRequests - v0.12 by @github-actions[bot] in #929

Full Changelog: v0.12.9...v0.12.10