cp-monitoing-srv deployment is failing regularly (once a couple of weeks) with java.net.SocketException: Too many open files
[INFO ] 2026-03-17 15:19:25.308 [taskScheduler-1] GpuUsageMonitoringService - Collecting gpu usages...
[ERROR] 2026-03-17 15:19:25.310 [taskScheduler-1] NodeReporterService - An error occurred while sending request to k8s
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [list] for kind: [Node] with name: [null] in namespace: [default] failed.
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) ~[kubernetes-client-4.6.1.jar!/:?]
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) ~[kubernetes-client-4.6.1.jar!/:?]
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.listRequestHelper(BaseOperation.java:157) ~[kubernetes-client-4.6.1.jar!/:?]
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:620) ~[kubernetes-client-4.6.1.jar!/:?]
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:69) ~[kubernetes-client-4.6.1.jar!/:?]
at com.epam.pipeline.monitor.service.k8s.KubernetesUtils.findNodesByLabel(KubernetesUtils.java:68) ~[classes!/:0.16.0.17505.d65f8b0b594c941d01b851f71e1c27fdedcfc18b]
at com.epam.pipeline.monitor.service.reporter.NodeReporterService.getGpuNodeNames(NodeReporterService.java:78) ~[classes!/:0.16.0.17505.d65f8b0b594c941d01b851f71e1c27fdedcfc18b]
at com.epam.pipeline.monitor.service.reporter.NodeReporterService.collectGpuUsages(NodeReporterService.java:66) ~[classes!/:0.16.0.17505.d65f8b0b594c941d01b851f71e1c27fdedcfc18b]
at com.epam.pipeline.monitor.monitoring.node.GpuUsageMonitoringService.monitor(GpuUsageMonitoringService.java:71) ~[classes!/:0.16.0.17505.d65f8b0b594c941d01b851f71e1c27fdedcfc18
b]
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-5.0.6.RELEASE.jar!/:5.0.6.RELEASE]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_412]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_412]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_412]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_412]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_412]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_412]
at java.lang.Thread.run(Thread.java:750) [?:1.8.0_412]
Caused by: java.net.SocketException: Too many open files
cp-monitoing-srvdeployment is failing regularly (once a couple of weeks) withjava.net.SocketException: Too many open files