Open
Conversation
… API calls for Jobs, Pods, and Nodes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add Kubernetes SharedInformerFactory cache for Jobs, Pods, and Nodes
Problem
Skaha has no informer cache. Every incoming HTTP request makes direct K8s API calls —
listNamespacedJob,listNamespacedPod,listNode— each returning the full set of objects serialized as JSON over HTTP.With 5000+ headless jobs and the Science Portal polling every few seconds per user, each poll triggers at minimum 3 API calls that each return thousands of objects. This creates significant load on the K8s API server and adds unnecessary latency to every session listing, stats query, and session creation (which checks existing session counts).
Affected code paths:
GET /v1/session— callsSessionDAO.getUserSessions()→listNamespacedJobGET /v1/session?view=stats— callsSessionDAO.getAllocatedPodResources()→listNamespacedPod+NodeDAO.getCapacity()→listNodeGET /v1/session?view=interactive— callsSessionDAO.getUserSessions()→listNamespacedJobGET /v1/session/{id}— callsSessionDAO.getUserSessions()→listNamespacedJobPOST /v1/session— callsSessionDAO.getUserSessions()→listNamespacedJob(session limit check)Solution
Introduce a
K8SInformerCachesingleton that uses the kubernetes-client-javaSharedInformerFactoryto maintain in-memory mirrors of Jobs, Pods, and Nodes via persistent watch streams.Write operations (delete/create jobs) remain as direct API calls.
PodResourceUsage(kubectl top / metrics API) is unchanged. All DAO methods fall back to direct API calls if the cache is not running.Changes
K8SInformerCache.javaSharedInformerFactorywith informers forV1Job(namespaced),V1Pod(namespaced), andV1Node(cluster-scoped)InitializationAction.javadoInit()SessionDAO.javagetUserSessions(),getJob(), andgetAllocatedPodResources()read from cache with in-memory filtering; fallback to direct API callsNodeDAO.javagetCapacities()reads from cache with in-memory filtering forunschedulableand label selectors; addsmatchesLabelSelector()utilityMeasurement
To verify the reduction in API server load after deployment:
LIST request counts for jobs/pods/nodes should drop to near-zero (only the initial list + periodic resync).