-
Notifications
You must be signed in to change notification settings - Fork 160
Open
Labels
bugSomething isn't workingSomething isn't working
Description
What happened?
ran into multiple issues attempting to upgrade, some of it are likely bugs, some of it is on us but we should probably have a migration guide to mitigate for others. impact was all pods that should have been scheduled were stuck in pending, with no events on describe output.
our values file
admission:
replicas: 3
binder:
replicas: 3
global:
leaderElection: true
requireDefaultPodAntiAffinityTerm: true
operator:
replicaCount: 3
podgroupcontroller:
replicas: 3
podgrouper:
replicas: 3
queuecontroller:
replicas: 3
scheduler:
replicas: 3
3 buckets of issues
- errors from pod logs upon upgrade
kai-scheduler (kai bug)
E0302 17:29:19.940833 1 reflector.go:205] "Failed to watch" err="failed to list *v1.ResourceClaim: the server could not find the requested resource (get resourceclaims.resource.k8s.io)" logger="UnhandledError" reflector="pkg/mod/k8s.io/client-go@v0.34.3/tools/cache/reflector.go:290" type="*v1.ResourceClaim"
E0302 17:29:39.298676 1 reflector.go:205] "Failed to watch" err="failed to list *v1.ResourceSlice: the server could not find the requested resource (get resourceslices.resource.k8s.io)" logger="UnhandledError" reflector="pkg/mod/k8s.io/client-go@v0.34.3/tools/cache/reflector.go:290" type="*v1.ResourceSlice"
queue-controller (kai bug)
flag provided but not defined: -queue-label-key
Usage of /workspace/app:
-burst int
Burst to the K8s API server (default 300)
-enable-webhook
Enable webhook for controller manager. (default true)
-kubeconfig string
Paths to a kubeconfig. Only required if out-of-cluster.
-leader-elect
Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.
-metrics-listen-address string
The address the metrics endpoint binds to. (default ":8080")
-metrics-namespace string
Metrics namespace. (default "kai")
-qps int
Queries per second to the K8s API server (default 50)
-queue-label-to-default-metric-value value
Map of queue label keys to default metric values, in case the label doesn't exist on the queue, e.g. 'foo=1,baz=0'.
-queue-label-to-metric-label value
Map of queue label keys to metric label keys, e.g. 'foo=bar,baz=qux'.
-skip-controller-name-validation
Skip controller name validation.
-zap-devel
Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn). Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error) (default true)
-zap-encoder value
Zap log encoding (one of 'json' or 'console')
-zap-log-level value
Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', 'panic'or any integer value > 0 which corresponds to custom debug levels of increasing verbosity
-zap-stacktrace-level value
Zap Level at and above which stacktraces are captured (one of 'info', 'error', 'panic').
-zap-time-encoding value
Zap time encoding (one of 'epoch', 'millis', 'nano', 'iso8601', 'rfc3339' or 'rfc3339nano'). Defaults to 'epoch'.
podgrouper (our issue but should be in migration guide)
trainjobs.trainer.kubeflow.org is forbidden: User \"system:serviceaccount:kai-scheduler:pod-grouper\" cannot list resource \"trainjobs\" in API group \"trainer.kubeflow.org\" at the cluster scope"}
jobsets.jobset.x-k8s.io is forbidden: User \"system:serviceaccount:kai-scheduler:pod-grouper\" cannot list resource \"jobsets\" in API group \"jobset.x-k8s.io\" at the cluster scope"}
- could not uninstall/reinstall due to hanging CRDs - had to manually delete
configs.kai.scheduler
schedulingshards.kai.scheduler
topologies.kai.scheduler
- helm rollback left the install in a corrupted state, perhaps the best advice is that kai doesn't support helm rollbacks and prefers helm uninstall/reinstall? but docs should probably make that clear
What did you expect to happen?
No response
Environment
- Kubernetes version
1.32.x - KAI Scheduler version
0.12.1 - Cloud provider or hardware configuration
bare metal - Tools that you are using KAI together with
kyverno, kubeflow trainer v1/v2 - Anything else that is relevant
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working