Skip to content

"monitoring-service" leaks file handlers #4332

@sidoruka

Description

@sidoruka

cp-monitoing-srv deployment is failing regularly (once a couple of weeks) with java.net.SocketException: Too many open files

[INFO ] 2026-03-17 15:19:25.308 [taskScheduler-1] GpuUsageMonitoringService - Collecting gpu usages...
[ERROR] 2026-03-17 15:19:25.310 [taskScheduler-1] NodeReporterService - An error occurred while sending request to k8s
io.fabric8.kubernetes.client.KubernetesClientException: Operation: [list]  for kind: [Node]  with name: [null]  in namespace: [default]  failed.
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) ~[kubernetes-client-4.6.1.jar!/:?]
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) ~[kubernetes-client-4.6.1.jar!/:?]
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.listRequestHelper(BaseOperation.java:157) ~[kubernetes-client-4.6.1.jar!/:?]
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:620) ~[kubernetes-client-4.6.1.jar!/:?]
        at io.fabric8.kubernetes.client.dsl.base.BaseOperation.list(BaseOperation.java:69) ~[kubernetes-client-4.6.1.jar!/:?]
        at com.epam.pipeline.monitor.service.k8s.KubernetesUtils.findNodesByLabel(KubernetesUtils.java:68) ~[classes!/:0.16.0.17505.d65f8b0b594c941d01b851f71e1c27fdedcfc18b]
        at com.epam.pipeline.monitor.service.reporter.NodeReporterService.getGpuNodeNames(NodeReporterService.java:78) ~[classes!/:0.16.0.17505.d65f8b0b594c941d01b851f71e1c27fdedcfc18b]
        at com.epam.pipeline.monitor.service.reporter.NodeReporterService.collectGpuUsages(NodeReporterService.java:66) ~[classes!/:0.16.0.17505.d65f8b0b594c941d01b851f71e1c27fdedcfc18b]
        at com.epam.pipeline.monitor.monitoring.node.GpuUsageMonitoringService.monitor(GpuUsageMonitoringService.java:71) ~[classes!/:0.16.0.17505.d65f8b0b594c941d01b851f71e1c27fdedcfc18
b]
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54) [spring-context-5.0.6.RELEASE.jar!/:5.0.6.RELEASE]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_412]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_412]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_412]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_412]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_412]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_412]
        at java.lang.Thread.run(Thread.java:750) [?:1.8.0_412]
Caused by: java.net.SocketException: Too many open files

Metadata

Metadata

Assignees

Labels

kind/bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions