Releases: NVIDIA/KAI-Scheduler
Releases · NVIDIA/KAI-Scheduler
v0.12.17
v0.9.14
What's Changed
- refactor: Represent podreferences as strings v0.9 by @itsomri in #985
- fix(scheduler): bind plugin server to localhost by @gshaibi in #996
- ci: add approval gatekeeper workflow for external contributor PRs by @KaiPilotBot in #1003
- fix(queue-controller): use Spec.Queue field indexer for resource aggregation (#1049) by @gshaibi in #1053
- chore: auto-resolve CHANGELOG.md merge conflicts with union strategy by @KaiPilotBot in #1054
- fix: skip runtimeClassName injection when gpuPodRuntimeClassName is e… by @enoodle in #1131
Full Changelog: v0.9.13...v0.9.14
v0.13.0
What's Changed
Added
- Added
global.nodeSelectorpropagation from Helm values to Config CR, ensuring operator-created sub-component deployments (admission, binder, scheduler, pod-grouper, etc.) receive the configured nodeSelector #1102 yuanchen8911 - Added
pluginsandactionsfields to SchedulingShard spec, allowing per-shard customization of scheduler plugin/action enablement, priority, and arguments gshaibi - Added support for Kubeflow Trainer v2 TrainJob workloads via skipTopOwner grouper pattern
- Added
binder.cdiEnabledHelm value to allow explicit override of CDI auto-detection for environments without ClusterPolicy - Added metric for tracking evicted pods in pod groups, including nodepool, eviction action, and gang size
- Block scheduling of pods with shared (non-template) DRA GPU claims that lack a queue label or have a mismatched queue label gshaibi
- Added the option to disable prometheus service monitor creation #810 itsomri
- Fixed prometheus instance deprecation - ensure single instance #779 itsomri
- Added clear error messages for jobs referencing missing or orphan queues, reporting via events and conditions #820 gshaibi
- Added rule selector for resource accounting prometheus #818 itsomri
- Made accounting labels configurable #818 itsomri
- Added support for Grove hierarchical topology constraints in PodGroup subgroups
- Added support for n-level queue hierarchies #858 gshaibi
- Added labels and annotations propagation from topOwner in SkipTopOwner grouper #861 SiorMeir
- Added scheduler name match conditions to admission webhooks to improve cluster stability
- Add Gpu Dra claims and resource slices accounting for the purpose of resource management and quota guarantees. *** This change doesn't support shared gpu claims or gpu claims with FirstAvailable *** #900 davidLif
- Added DRA resources recording to snapshot #830
- Temporarily Prevent device-plugin GPU pods on DRA-only nodes - until translation between device-plugin notation and DRA is implemented
- Implemented subgroups for pytorchjobs #935 itsomri
- Made KAI images distroless #745 dttung2905
- Allow setting empty gpuPodRuntimeClassName during helm install #972 steved
- Created scale tests scenarios for running scale tests for KAI #967
- Implemented block-level segmentation for pytorchjobs #938 itsomri
- Added scale test environment setup script and updated service monitors for KAI scheduler #1031
- Implemented subgroups for leaderworkerset #1046 davidLif
- Added discovery data to snapshot for more accurate debugging #1047 itsomri
- Implemented subgroup segmentation (with topology segment definitions) for leaderworkerset #1058 davidLif
Fixed
- Fixed operator status conditions to be kstatus-compatible for Helm 4
--waitsupport: addedReadycondition and fixedReconcilingcondition to properly transition to false after reconciliation completes #1060 - Fixed a bug where the node scale adjuster would not check if a pod was unschedulable before creating a scaling pod leading to unnecessary node scaling #1094 slaupster
- Fixed admission webhook to skip runtimeClassName injection when gpuPodRuntimeClassName is empty #1035
- Fixed topology-migration helm hook failing on OpenShift due to missing
kai-topology-migrationservice account in thekai-systemSCC #1050 - Fixed a bug where queue status did not reflect its podgroups resources correctly #1049
- Fixed helm uninstall does not remove webhooks #959 faizan-exe
- Fixed security vulnerability where PodGang could reference pods in other namespaces, preventing cross-namespace manipulation
- Fixed pod controller logging to use request namespace/name instead of empty pod object fields when pod is not found
- Fixed a bug where topology constrains with equal required and preferred levels would cause preferred level not to be found.
- Fixed GPU memory pods Fair Share and Queue Order calculations
- Interpret negative or zero half-life value as disabled #818 itsomri
- Handle invalid CSI StorageCapacities gracefully #817 rich7420
- Embed CRD definitions in binary for env-test and time-aware-simulations to allow binary portability #818 itsomri
- Fixed missing
podGrouperconfiguration in Helm template that prevented podgrouper values from being applied #860 - Fixed rollback for failed bind attempts #847 itsomri
- Fixed missing
namespace,serviceAccountName, andappLabelfields inresourceReservationsection of kai-config Helm template #860 dttung2905 - If a preferred topology constraint is set, do not try to find a lowest common subtree (as a part of the calculations optimizations) which is lower then the preferred level
- Added dedicated
usage-prometheusservice for scheduler Prometheus access with configurable instance name #896 itsomri - ClusterPolicy CDI parsing for gpu-operator > v25.10.0
- Fixed missing
repository,tag, andpullPolicyfields inresourceReservationImagesection of kai-config Helm template #895 dttung2905 - Fixed a bug in ray gang scheduling where not all worker groups' minMember would be respected #924 itsomri
- cpu-only nodes calculation in DRA enabled clusters #944
- enable DRA flag override fix in snapshot-tool #955
- Fixed ConfigMap predicate to respect the Optional field and now considers ConfigMaps in projected volumes and ephemeral containers
- Fixed simulations that failed due to pod capacity on node #969 itsomri
- Fixed a bug where some resource claims would remain marked as bound to devices forever
Changed
- Removed the constraint that prohibited direct nesting of subgroups alongside podsets within the same subgroupset.
- Fixed plugin server (snapshot and job-order endpoints) listening on all interfaces by binding to localhost only.
- Removed redundant
connectionfield fromGlobalConfigin favor ofPrometheus.ExternalPrometheusUrlfor external Prometheus URL configuration
New Contributors
- @rich7420 made their first contribution in #816
- @Ronkahn21 made their first contribution in #821
- @faizan-exe made their first contribution in #913
- @lalitadithya made their first contribution in #954
- @steved made their first contribution in #972
- @yuanchen8911 made their first contribution in #1035
- @Hagay-RunAI made their first contribution in #1115
Full Changelog: v0.12.0...v0.13.0
v0.12.16
What's Changed
Fixed
- Fixed operator status conditions to be kstatus-compatible for Helm 4
--waitsupport: addedReadycondition and fixedReconcilingcondition to properly transition to false after reconciliation completes #1060
Full Changelog: v0.12.15...v0.12.16
v0.12.15
What's Changed
Fixed
- Fixed a bug where queue status did not reflect its podgroups resources correctly #1049
- Fixed topology-migration helm hook failing on OpenShift due to missing
kai-topology-migrationservice account in thekai-systemSCC #1050
Full Changelog: v0.12.14...v0.12.15
v0.12.14
v0.12.13
v0.12.12
What's Changed
Fixed
- Fixed a bug in ray gang scheduling where not all worker groups' minMember would be respected #962 itsomri
- Fixed plugin server (snapshot and job-order endpoints) listening on all interfaces by binding to localhost only.
Full Changelog: v0.12.11...v0.12.12
v0.12.11
What's Changed
Fixed
- Added
binder.cdiEnabledHelm value to allow explicit override of CDI auto-detection for environments without ClusterPolicy fixing compatibility issues in Openshift
Full Changelog: v0.12.10...v0.12.11