Skip to content

Implements agent_sandbox_warmpool_size metric#358

Open
Oneimu wants to merge 1 commit intokubernetes-sigs:mainfrom
Oneimu:warmpool-gauge-metrics
Open

Implements agent_sandbox_warmpool_size metric#358
Oneimu wants to merge 1 commit intokubernetes-sigs:mainfrom
Oneimu:warmpool-gauge-metrics

Conversation

@Oneimu
Copy link
Contributor

@Oneimu Oneimu commented Mar 2, 2026

Implements agent_sandbox_warmpool_size metric to monitor the status of warm pools in the agent-sandbox.

Key Changes

  • internal/metrics: Introduced a Metrics struct that implements the prometheus.Collector interface. It dynamically scrapes the current state of SandboxWarmPools and their associated Pods from the controller's cache.
  • cmd/agent-sandbox-controller: Initialized and registered the collector in main.go, passing the manager's cache for efficient lookups.
  • Testing: Added comprehensive unit tests in internal/metrics and updated controller tests to verify metric labels and counts using a fake Kubernetes client.

@netlify
Copy link

netlify bot commented Mar 2, 2026

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 5361a41
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/69a5f2f78535890008a18913

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Oneimu
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 2, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @Oneimu. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 2, 2026
@Oneimu
Copy link
Contributor Author

Oneimu commented Mar 2, 2026

/assign @igooch

@k8s-ci-robot
Copy link
Contributor

@Oneimu: GitHub didn't allow me to assign the following users: igooch.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

Details

In response to this:

/assign @igooch

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@igooch igooch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach lists all pods on every Prometheus scrape (every 10-30s), which can cause unnecessary CPU spikes.

I recommend using a prometheus.NewGaugeVec registered via func init() { crmetrics.Registry.MustRegister(WarmPoolSize) } instead.

Since the SandboxWarmPool controller already lists all its pods during the Reconcile loop, you can reuse that list to update the metric there. This makes Prometheus scrapes O(1).

}

// Collect implements prometheus.Collector.
// This is called by Prometheus during setiap scrape.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

Comment on lines +87 to +88
// List all pods in the same namespace as the warmpool
podList := &corev1.PodList{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nested loop. If you have 10 warmpools in the default namespace, you list all pods in the in the default namespace 10 times.

// List all pods in the same namespace as the warmpool
podList := &corev1.PodList{}
if err := m.client.List(ctx, podList, &client.ListOptions{Namespace: wp.Namespace}); err != nil {
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above comment.

// List all SandboxWarmPools across all namespaces
var warmPools extensionsv1alpha1.SandboxWarmPoolList
if err := m.client.List(ctx, &warmPools); err != nil {
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to ignore the error? We should at least log it even if not fatal.


// Verify ownership: Pod must be owned by this SandboxWarmPool
ownedByPool := false
for _, ref := range pod.OwnerReferences {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use metav1.GetControllerOf(&pod) here?

// PodStatusOther indicates any other pod status.
PodStatusOther = "other"
// PodStatusAll indicates the total count of all pods in the pool.
PodStatusAll = "*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this. The metric will automatically have an aggregate count.

// PodStatusFailed indicates the pod has failed.
PodStatusFailed = "failed"
// PodStatusOther indicates any other pod status.
PodStatusOther = "other"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be "unknown" per https://pkg.go.dev/k8s.io/api/core/v1#PodPhase. Recommend reusing this api instead of recreating the const.

@janetkuo janetkuo added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 4, 2026
@k8s-ci-robot
Copy link
Contributor

@Oneimu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
presubmit-agent-sandbox-unit-test 5361a41 link true /test presubmit-agent-sandbox-unit-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2026
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants