Skip to content

K8S: add client side connection time outs in calls to avoid hanging due to network outage #296

Merged
fbarreir merged 1 commit intomasterfrom
k8s_timeouts
Jan 19, 2026
Merged

K8S: add client side connection time outs in calls to avoid hanging due to network outage #296
fbarreir merged 1 commit intomasterfrom
k8s_timeouts

Conversation

@fbarreir
Copy link
Collaborator

No description provided.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds client-side connection timeouts to all Kubernetes API calls to prevent operations from hanging indefinitely during network outages. The timeout values are set uniformly to (5, 20) seconds representing connect and read timeouts respectively.

Changes:

  • Added _request_timeout=(5, 20) parameter to all Kubernetes API client method calls
  • Applied consistently across job creation, pod/job listing, deletion operations, config map management, secret operations, and log retrieval

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tmp_log.debug(f"creating job {yaml_content}")

rsp = self.batchv1.create_namespaced_job(body=yaml_content, namespace=self.namespace)
rsp = self.batchv1.create_namespaced_job(body=yaml_content, namespace=self.namespace, _request_timeout=(5, 20))
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded timeout values (5, 20) should be made configurable. Consider defining these as class constants or configuration parameters to allow adjustment without code changes. This is especially important for production environments where network conditions may vary.

Copilot uses AI. Check for mistakes.
tmp_log.debug(f"creating job {yaml_content}")

rsp = self.batchv1.create_namespaced_job(body=yaml_content, namespace=self.namespace)
rsp = self.batchv1.create_namespaced_job(body=yaml_content, namespace=self.namespace, _request_timeout=(5, 20))
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout tuple format (5, 20) represents (connect_timeout, read_timeout) in the Kubernetes Python client. However, the 5-second connect timeout may be too aggressive for some network environments, especially when the Kubernetes API server is under heavy load. Consider increasing the connect timeout to at least 10 seconds or making it configurable based on deployment requirements.

Suggested change
rsp = self.batchv1.create_namespaced_job(body=yaml_content, namespace=self.namespace, _request_timeout=(5, 20))
rsp = self.batchv1.create_namespaced_job(body=yaml_content, namespace=self.namespace, _request_timeout=(10, 20))

Copilot uses AI. Check for mistakes.
tmp_log = core_utils.make_logger(base_logger, f"queue_name={self.queue_name}", method_name="get_pod_logs")
try:
rsp = self.corev1.read_namespaced_pod_log(name=pod_name, namespace=self.namespace, previous=previous)
rsp = self.corev1.read_namespaced_pod_log(name=pod_name, namespace=self.namespace, previous=previous, _request_timeout=(5, 20))
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read_namespaced_pod_log operation can potentially return large amounts of data when retrieving pod logs. The 20-second read timeout may be insufficient for pods with extensive logs. Consider using a longer timeout for this specific operation or implementing streaming/pagination for log retrieval.

Suggested change
rsp = self.corev1.read_namespaced_pod_log(name=pod_name, namespace=self.namespace, previous=previous, _request_timeout=(5, 20))
rsp = self.corev1.read_namespaced_pod_log(
name=pod_name,
namespace=self.namespace,
previous=previous,
_request_timeout=(5, 300),
)

Copilot uses AI. Check for mistakes.

# create the configmap object in K8s
api_response = self.corev1.create_namespaced_config_map(namespace=self.namespace, body=config_map)
api_response = self.corev1.create_namespaced_config_map(namespace=self.namespace, body=config_map, _request_timeout=(5, 20))
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable api_response is not used.

Copilot uses AI. Check for mistakes.
except ApiException as e:
tmp_log.debug(f"Exception when patching pilots-starter config_map: {e} . Try to create it instead...")
api_response = self.corev1.create_namespaced_config_map(namespace=self.namespace, body=config_map)
api_response = self.corev1.create_namespaced_config_map(namespace=self.namespace, body=config_map, _request_timeout=(5, 20))
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable api_response is not used.

Copilot uses AI. Check for mistakes.

try:
api_response = self.corev1.patch_namespaced_config_map(name=name, body=config_map, namespace=self.namespace)
api_response = self.corev1.patch_namespaced_config_map(name=name, body=config_map, namespace=self.namespace, _request_timeout=(5, 20))
Copy link

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment to 'api_response' is unnecessary as it is redefined before this value is used.

Copilot uses AI. Check for mistakes.
@fbarreir fbarreir merged commit 920deb7 into master Jan 19, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant