K8S: add client side connection time outs in calls to avoid hanging due to network outage #296
K8S: add client side connection time outs in calls to avoid hanging due to network outage #296
Conversation
…because of site issues
There was a problem hiding this comment.
Pull request overview
This PR adds client-side connection timeouts to all Kubernetes API calls to prevent operations from hanging indefinitely during network outages. The timeout values are set uniformly to (5, 20) seconds representing connect and read timeouts respectively.
Changes:
- Added
_request_timeout=(5, 20)parameter to all Kubernetes API client method calls - Applied consistently across job creation, pod/job listing, deletion operations, config map management, secret operations, and log retrieval
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| tmp_log.debug(f"creating job {yaml_content}") | ||
|
|
||
| rsp = self.batchv1.create_namespaced_job(body=yaml_content, namespace=self.namespace) | ||
| rsp = self.batchv1.create_namespaced_job(body=yaml_content, namespace=self.namespace, _request_timeout=(5, 20)) |
There was a problem hiding this comment.
The hardcoded timeout values (5, 20) should be made configurable. Consider defining these as class constants or configuration parameters to allow adjustment without code changes. This is especially important for production environments where network conditions may vary.
| tmp_log.debug(f"creating job {yaml_content}") | ||
|
|
||
| rsp = self.batchv1.create_namespaced_job(body=yaml_content, namespace=self.namespace) | ||
| rsp = self.batchv1.create_namespaced_job(body=yaml_content, namespace=self.namespace, _request_timeout=(5, 20)) |
There was a problem hiding this comment.
The timeout tuple format (5, 20) represents (connect_timeout, read_timeout) in the Kubernetes Python client. However, the 5-second connect timeout may be too aggressive for some network environments, especially when the Kubernetes API server is under heavy load. Consider increasing the connect timeout to at least 10 seconds or making it configurable based on deployment requirements.
| rsp = self.batchv1.create_namespaced_job(body=yaml_content, namespace=self.namespace, _request_timeout=(5, 20)) | |
| rsp = self.batchv1.create_namespaced_job(body=yaml_content, namespace=self.namespace, _request_timeout=(10, 20)) |
| tmp_log = core_utils.make_logger(base_logger, f"queue_name={self.queue_name}", method_name="get_pod_logs") | ||
| try: | ||
| rsp = self.corev1.read_namespaced_pod_log(name=pod_name, namespace=self.namespace, previous=previous) | ||
| rsp = self.corev1.read_namespaced_pod_log(name=pod_name, namespace=self.namespace, previous=previous, _request_timeout=(5, 20)) |
There was a problem hiding this comment.
The read_namespaced_pod_log operation can potentially return large amounts of data when retrieving pod logs. The 20-second read timeout may be insufficient for pods with extensive logs. Consider using a longer timeout for this specific operation or implementing streaming/pagination for log retrieval.
| rsp = self.corev1.read_namespaced_pod_log(name=pod_name, namespace=self.namespace, previous=previous, _request_timeout=(5, 20)) | |
| rsp = self.corev1.read_namespaced_pod_log( | |
| name=pod_name, | |
| namespace=self.namespace, | |
| previous=previous, | |
| _request_timeout=(5, 300), | |
| ) |
|
|
||
| # create the configmap object in K8s | ||
| api_response = self.corev1.create_namespaced_config_map(namespace=self.namespace, body=config_map) | ||
| api_response = self.corev1.create_namespaced_config_map(namespace=self.namespace, body=config_map, _request_timeout=(5, 20)) |
There was a problem hiding this comment.
Variable api_response is not used.
| except ApiException as e: | ||
| tmp_log.debug(f"Exception when patching pilots-starter config_map: {e} . Try to create it instead...") | ||
| api_response = self.corev1.create_namespaced_config_map(namespace=self.namespace, body=config_map) | ||
| api_response = self.corev1.create_namespaced_config_map(namespace=self.namespace, body=config_map, _request_timeout=(5, 20)) |
There was a problem hiding this comment.
Variable api_response is not used.
|
|
||
| try: | ||
| api_response = self.corev1.patch_namespaced_config_map(name=name, body=config_map, namespace=self.namespace) | ||
| api_response = self.corev1.patch_namespaced_config_map(name=name, body=config_map, namespace=self.namespace, _request_timeout=(5, 20)) |
No description provided.