Skip to content

need migration guide for 0.12.x -> 0.13.x #1124

@sam-huang1223

Description

@sam-huang1223

What happened?

ran into multiple issues attempting to upgrade, some of it are likely bugs, some of it is on us but we should probably have a migration guide to mitigate for others. impact was all pods that should have been scheduled were stuck in pending, with no events on describe output.

our values file

admission:
  replicas: 3
binder:
  replicas: 3
global:
  leaderElection: true
  requireDefaultPodAntiAffinityTerm: true
operator:
  replicaCount: 3
podgroupcontroller:
  replicas: 3
podgrouper:
  replicas: 3
queuecontroller:
  replicas: 3
scheduler:
  replicas: 3

3 buckets of issues

  1. errors from pod logs upon upgrade
    kai-scheduler (kai bug)
E0302 17:29:19.940833       1 reflector.go:205] "Failed to watch" err="failed to list *v1.ResourceClaim: the server could not find the requested resource (get resourceclaims.resource.k8s.io)" logger="UnhandledError" reflector="pkg/mod/k8s.io/client-go@v0.34.3/tools/cache/reflector.go:290" type="*v1.ResourceClaim"
E0302 17:29:39.298676       1 reflector.go:205] "Failed to watch" err="failed to list *v1.ResourceSlice: the server could not find the requested resource (get resourceslices.resource.k8s.io)" logger="UnhandledError" reflector="pkg/mod/k8s.io/client-go@v0.34.3/tools/cache/reflector.go:290" type="*v1.ResourceSlice"

queue-controller (kai bug)

flag provided but not defined: -queue-label-key
Usage of /workspace/app:
  -burst int
    	Burst to the K8s API server (default 300)
  -enable-webhook
    	Enable webhook for controller manager. (default true)
  -kubeconfig string
    	Paths to a kubeconfig. Only required if out-of-cluster.
  -leader-elect
    	Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.
  -metrics-listen-address string
    	The address the metrics endpoint binds to. (default ":8080")
  -metrics-namespace string
    	Metrics namespace. (default "kai")
  -qps int
    	Queries per second to the K8s API server (default 50)
  -queue-label-to-default-metric-value value
    	Map of queue label keys to default metric values, in case the label doesn't exist on the queue, e.g. 'foo=1,baz=0'.
  -queue-label-to-metric-label value
    	Map of queue label keys to metric label keys, e.g. 'foo=bar,baz=qux'.
  -skip-controller-name-validation
    	Skip controller name validation.
  -zap-devel
    	Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn). Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error) (default true)
  -zap-encoder value
    	Zap log encoding (one of 'json' or 'console')
  -zap-log-level value
    	Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', 'panic'or any integer value > 0 which corresponds to custom debug levels of increasing verbosity
  -zap-stacktrace-level value
    	Zap Level at and above which stacktraces are captured (one of 'info', 'error', 'panic').
  -zap-time-encoding value
    	Zap time encoding (one of 'epoch', 'millis', 'nano', 'iso8601', 'rfc3339' or 'rfc3339nano'). Defaults to 'epoch'.

podgrouper (our issue but should be in migration guide)

trainjobs.trainer.kubeflow.org is forbidden: User \"system:serviceaccount:kai-scheduler:pod-grouper\" cannot list resource \"trainjobs\" in API group \"trainer.kubeflow.org\" at the cluster scope"}
jobsets.jobset.x-k8s.io is forbidden: User \"system:serviceaccount:kai-scheduler:pod-grouper\" cannot list resource \"jobsets\" in API group \"jobset.x-k8s.io\" at the cluster scope"}
  1. could not uninstall/reinstall due to hanging CRDs - had to manually delete
configs.kai.scheduler
schedulingshards.kai.scheduler
topologies.kai.scheduler
  1. helm rollback left the install in a corrupted state, perhaps the best advice is that kai doesn't support helm rollbacks and prefers helm uninstall/reinstall? but docs should probably make that clear

What did you expect to happen?

No response

Environment

  • Kubernetes version
    1.32.x
  • KAI Scheduler version
    0.12.1
  • Cloud provider or hardware configuration
    bare metal
  • Tools that you are using KAI together with
    kyverno, kubeflow trainer v1/v2
  • Anything else that is relevant

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions