Implements agent_sandbox_warmpool_size metric#358
Implements agent_sandbox_warmpool_size metric#358Oneimu wants to merge 1 commit intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Oneimu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @Oneimu. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/assign @igooch |
|
@Oneimu: GitHub didn't allow me to assign the following users: igooch. Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
igooch
left a comment
There was a problem hiding this comment.
This approach lists all pods on every Prometheus scrape (every 10-30s), which can cause unnecessary CPU spikes.
I recommend using a prometheus.NewGaugeVec registered via func init() { crmetrics.Registry.MustRegister(WarmPoolSize) } instead.
Since the SandboxWarmPool controller already lists all its pods during the Reconcile loop, you can reuse that list to update the metric there. This makes Prometheus scrapes O(1).
| } | ||
|
|
||
| // Collect implements prometheus.Collector. | ||
| // This is called by Prometheus during setiap scrape. |
| // List all pods in the same namespace as the warmpool | ||
| podList := &corev1.PodList{} |
There was a problem hiding this comment.
This is a nested loop. If you have 10 warmpools in the default namespace, you list all pods in the in the default namespace 10 times.
| // List all pods in the same namespace as the warmpool | ||
| podList := &corev1.PodList{} | ||
| if err := m.client.List(ctx, podList, &client.ListOptions{Namespace: wp.Namespace}); err != nil { | ||
| return |
| // List all SandboxWarmPools across all namespaces | ||
| var warmPools extensionsv1alpha1.SandboxWarmPoolList | ||
| if err := m.client.List(ctx, &warmPools); err != nil { | ||
| return |
There was a problem hiding this comment.
Is there a reason to ignore the error? We should at least log it even if not fatal.
|
|
||
| // Verify ownership: Pod must be owned by this SandboxWarmPool | ||
| ownedByPool := false | ||
| for _, ref := range pod.OwnerReferences { |
There was a problem hiding this comment.
Can we use metav1.GetControllerOf(&pod) here?
| // PodStatusOther indicates any other pod status. | ||
| PodStatusOther = "other" | ||
| // PodStatusAll indicates the total count of all pods in the pool. | ||
| PodStatusAll = "*" |
There was a problem hiding this comment.
You don't need this. The metric will automatically have an aggregate count.
| // PodStatusFailed indicates the pod has failed. | ||
| PodStatusFailed = "failed" | ||
| // PodStatusOther indicates any other pod status. | ||
| PodStatusOther = "other" |
There was a problem hiding this comment.
This should be "unknown" per https://pkg.go.dev/k8s.io/api/core/v1#PodPhase. Recommend reusing this api instead of recreating the const.
|
@Oneimu: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Implements agent_sandbox_warmpool_size metric to monitor the status of warm pools in the agent-sandbox.
Key Changes
internal/metrics: Introduced a Metrics struct that implements the prometheus.Collector interface. It dynamically scrapes the current state of SandboxWarmPools and their associated Pods from the controller's cache.cmd/agent-sandbox-controller: Initialized and registered the collector in main.go, passing the manager's cache for efficient lookups.