Skip to content

Informer cache#1034

Open
szautkin wants to merge 1 commit intoopencadc:mainfrom
szautkin:skaha-informer-cache
Open

Informer cache#1034
szautkin wants to merge 1 commit intoopencadc:mainfrom
szautkin:skaha-informer-cache

Conversation

@szautkin
Copy link

@szautkin szautkin commented Mar 5, 2026

Add Kubernetes SharedInformerFactory cache for Jobs, Pods, and Nodes

Problem

Skaha has no informer cache. Every incoming HTTP request makes direct K8s API calls — listNamespacedJob, listNamespacedPod, listNode — each returning the full set of objects serialized as JSON over HTTP.

With 5000+ headless jobs and the Science Portal polling every few seconds per user, each poll triggers at minimum 3 API calls that each return thousands of objects. This creates significant load on the K8s API server and adds unnecessary latency to every session listing, stats query, and session creation (which checks existing session counts).

Affected code paths:

  • GET /v1/session — calls SessionDAO.getUserSessions()listNamespacedJob
  • GET /v1/session?view=stats — calls SessionDAO.getAllocatedPodResources()listNamespacedPod + NodeDAO.getCapacity()listNode
  • GET /v1/session?view=interactive — calls SessionDAO.getUserSessions()listNamespacedJob
  • GET /v1/session/{id} — calls SessionDAO.getUserSessions()listNamespacedJob
  • POST /v1/session — calls SessionDAO.getUserSessions()listNamespacedJob (session limit check)

Solution

Introduce a K8SInformerCache singleton that uses the kubernetes-client-java SharedInformerFactory to maintain in-memory mirrors of Jobs, Pods, and Nodes via persistent watch streams.

  • On startup: One initial LIST + a persistent WATCH connection per resource type (3 total)
  • On each request: All read operations are served from the in-memory cache with zero network calls. Filtering by user, session ID, type, and pod status happens in-memory.
  • Resync: 30-second periodic resync to ensure consistency

Write operations (delete/create jobs) remain as direct API calls. PodResourceUsage (kubectl top / metrics API) is unchanged. All DAO methods fall back to direct API calls if the cache is not running.

Changes

File Change
K8SInformerCache.java New singleton managing SharedInformerFactory with informers for V1Job (namespaced), V1Pod (namespaced), and V1Node (cluster-scoped)
InitializationAction.java Starts the informer cache after K8s API client initialization in doInit()
SessionDAO.java getUserSessions(), getJob(), and getAllocatedPodResources() read from cache with in-memory filtering; fallback to direct API calls
NodeDAO.java getCapacities() reads from cache with in-memory filtering for unschedulable and label selectors; adds matchesLabelSelector() utility

Measurement

To verify the reduction in API server load after deployment:

kubectl get --raw /metrics | grep apiserver_request_total | grep LIST | grep -E 'jobs|pods|nodes'

LIST request counts for jobs/pods/nodes should drop to near-zero (only the initial list + periodic resync).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants