Releases · NVIDIA/KAI-Scheduler · GitHub

04 Mar 16:52

enoodle

v0.12.17

What's Changed

fix: skip runtimeClassName injection when gpuPodRuntimeClassName is e… by @enoodle in #1130

Full Changelog: v0.12.16...v0.12.17

Contributors

enoodle

Assets 3

04 Mar 16:51

enoodle

v0.9.14

What's Changed

refactor: Represent podreferences as strings v0.9 by @itsomri in #985
fix(scheduler): bind plugin server to localhost by @gshaibi in #996
ci: add approval gatekeeper workflow for external contributor PRs by @KaiPilotBot in #1003
fix(queue-controller): use Spec.Queue field indexer for resource aggregation (#1049) by @gshaibi in #1053
chore: auto-resolve CHANGELOG.md merge conflicts with union strategy by @KaiPilotBot in #1054
fix: skip runtimeClassName injection when gpuPodRuntimeClassName is e… by @enoodle in #1131

Full Changelog: v0.9.13...v0.9.14

Contributors

enoodle, gshaibi, and 2 other contributors

Assets 3

02 Mar 10:32

gshaibi

v0.13.0 Latest

Latest

What's Changed

Added

Added global.nodeSelector propagation from Helm values to Config CR, ensuring operator-created sub-component deployments (admission, binder, scheduler, pod-grouper, etc.) receive the configured nodeSelector #1102 yuanchen8911
Added plugins and actions fields to SchedulingShard spec, allowing per-shard customization of scheduler plugin/action enablement, priority, and arguments gshaibi
Added support for Kubeflow Trainer v2 TrainJob workloads via skipTopOwner grouper pattern
Added binder.cdiEnabled Helm value to allow explicit override of CDI auto-detection for environments without ClusterPolicy
Added metric for tracking evicted pods in pod groups, including nodepool, eviction action, and gang size
Block scheduling of pods with shared (non-template) DRA GPU claims that lack a queue label or have a mismatched queue label gshaibi
Added the option to disable prometheus service monitor creation #810 itsomri
Fixed prometheus instance deprecation - ensure single instance #779 itsomri
Added clear error messages for jobs referencing missing or orphan queues, reporting via events and conditions #820 gshaibi
Added rule selector for resource accounting prometheus #818 itsomri
Made accounting labels configurable #818 itsomri
Added support for Grove hierarchical topology constraints in PodGroup subgroups
Added support for n-level queue hierarchies #858 gshaibi
Added labels and annotations propagation from topOwner in SkipTopOwner grouper #861 SiorMeir
Added scheduler name match conditions to admission webhooks to improve cluster stability
Add Gpu Dra claims and resource slices accounting for the purpose of resource management and quota guarantees. *** This change doesn't support shared gpu claims or gpu claims with FirstAvailable *** #900 davidLif
Added DRA resources recording to snapshot #830
Temporarily Prevent device-plugin GPU pods on DRA-only nodes - until translation between device-plugin notation and DRA is implemented
Implemented subgroups for pytorchjobs #935 itsomri
Made KAI images distroless #745 dttung2905
Allow setting empty gpuPodRuntimeClassName during helm install #972 steved
Created scale tests scenarios for running scale tests for KAI #967
Implemented block-level segmentation for pytorchjobs #938 itsomri
Added scale test environment setup script and updated service monitors for KAI scheduler #1031
Implemented subgroups for leaderworkerset #1046 davidLif
Added discovery data to snapshot for more accurate debugging #1047 itsomri
Implemented subgroup segmentation (with topology segment definitions) for leaderworkerset #1058 davidLif

Fixed

Fixed operator status conditions to be kstatus-compatible for Helm 4 --wait support: added Ready condition and fixed Reconciling condition to properly transition to false after reconciliation completes #1060
Fixed a bug where the node scale adjuster would not check if a pod was unschedulable before creating a scaling pod leading to unnecessary node scaling #1094 slaupster
Fixed admission webhook to skip runtimeClassName injection when gpuPodRuntimeClassName is empty #1035
Fixed topology-migration helm hook failing on OpenShift due to missing kai-topology-migration service account in the kai-system SCC #1050
Fixed a bug where queue status did not reflect its podgroups resources correctly #1049
Fixed helm uninstall does not remove webhooks #959 faizan-exe
Fixed security vulnerability where PodGang could reference pods in other namespaces, preventing cross-namespace manipulation
Fixed pod controller logging to use request namespace/name instead of empty pod object fields when pod is not found
Fixed a bug where topology constrains with equal required and preferred levels would cause preferred level not to be found.
Fixed GPU memory pods Fair Share and Queue Order calculations
Interpret negative or zero half-life value as disabled #818 itsomri
Handle invalid CSI StorageCapacities gracefully #817 rich7420
Embed CRD definitions in binary for env-test and time-aware-simulations to allow binary portability #818 itsomri
Fixed missing podGrouper configuration in Helm template that prevented podgrouper values from being applied #860
Fixed rollback for failed bind attempts #847 itsomri
Fixed missing namespace, serviceAccountName, and appLabel fields in resourceReservation section of kai-config Helm template #860 dttung2905
If a preferred topology constraint is set, do not try to find a lowest common subtree (as a part of the calculations optimizations) which is lower then the preferred level
Added dedicated usage-prometheus service for scheduler Prometheus access with configurable instance name #896 itsomri
ClusterPolicy CDI parsing for gpu-operator > v25.10.0
Fixed missing repository, tag, and pullPolicy fields in resourceReservationImage section of kai-config Helm template #895 dttung2905
Fixed a bug in ray gang scheduling where not all worker groups' minMember would be respected #924 itsomri
cpu-only nodes calculation in DRA enabled clusters #944
enable DRA flag override fix in snapshot-tool #955
Fixed ConfigMap predicate to respect the Optional field and now considers ConfigMaps in projected volumes and ephemeral containers
Fixed simulations that failed due to pod capacity on node #969 itsomri
Fixed a bug where some resource claims would remain marked as bound to devices forever

Changed

Removed the constraint that prohibited direct nesting of subgroups alongside podsets within the same subgroupset.
Fixed plugin server (snapshot and job-order endpoints) listening on all interfaces by binding to localhost only.
Removed redundant connection field from GlobalConfig in favor of Prometheus.ExternalPrometheusUrl for external Prometheus URL configuration

New Contributors

@rich7420 made their first contribution in #816
@Ronkahn21 made their first contribution in #821
@faizan-exe made their first contribution in #913
@lalitadithya made their first contribution in #954
@steved made their first contribution in #972
@yuanchen8911 made their first contribution in #1035
@Hagay-RunAI made their first contribution in #1115

Full Changelog: v0.12.0...v0.13.0

Contributors

steved, lalitadithya, and 5 other contributors

Assets 3

02 Mar 12:50

gshaibi

v0.12.16

What's Changed

Fixed

Fixed operator status conditions to be kstatus-compatible for Helm 4 --wait support: added Ready condition and fixed Reconciling condition to properly transition to false after reconciliation completes #1060

Full Changelog: v0.12.15...v0.12.16

Assets 3

25 Feb 10:44

gshaibi

v0.12.15

What's Changed

Fixed

Fixed a bug where queue status did not reflect its podgroups resources correctly #1049
Fixed topology-migration helm hook failing on OpenShift due to missing kai-topology-migration service account in the kai-system SCC #1050

Full Changelog: v0.12.14...v0.12.15

Assets 3

18 Feb 13:28

itsomri

v0.12.14

What's Changed

Allow configuration of plugins/actions from helm #1026 itsomri

Full Changelog: v0.12.13...v0.12.14

Assets 3

17 Feb 13:29

gshaibi

v0.12.13

What's Changed

Added

Added plugins and actions fields to SchedulingShard spec, allowing per-shard customization of scheduler plugin/action enablement, priority, and arguments #966 gshaibi

Full Changelog: v0.12.12...v0.12.13

Assets 3

12 Feb 14:32

gshaibi

v0.12.12

What's Changed

Fixed

Fixed a bug in ray gang scheduling where not all worker groups' minMember would be respected #962 itsomri
Fixed plugin server (snapshot and job-order endpoints) listening on all interfaces by binding to localhost only.

Full Changelog: v0.12.11...v0.12.12

Assets 3

04 Feb 09:38

enoodle

v0.12.11

What's Changed

Fixed

Added binder.cdiEnabled Helm value to allow explicit override of CDI auto-detection for environments without ClusterPolicy fixing compatibility issues in Openshift

Full Changelog: v0.12.10...v0.12.11

Assets 3

26 Jan 13:24

davidLif

v0.12.10

What's Changed

fix(scheduler): Remove pod-name label from bindingRequests - v0.12 by @github-actions[bot] in #929

Full Changelog: v0.12.9...v0.12.10

Assets 3