Skip to content

feat: add custom Prometheus metrics for CPU allocation monitoring#76

Open
AutuSnow wants to merge 1 commit intokubernetes-sigs:mainfrom
AutuSnow:add_metrics
Open

feat: add custom Prometheus metrics for CPU allocation monitoring#76
AutuSnow wants to merge 1 commit intokubernetes-sigs:mainfrom
AutuSnow:add_metrics

Conversation

@AutuSnow
Copy link
Copy Markdown
Contributor

@AutuSnow AutuSnow commented Mar 9, 2026

Add 7 custom Prometheus metrics to the /metrics endpoint:

  • dra_cpu_allocated_cpus (gauge)
  • dra_cpu_available_cpus (gauge)
  • dra_cpu_reserved_cpus (gauge)
  • dra_cpu_resource_claims_active (gauge)
  • dra_cpu_prepare_claims_success_total (counter)
  • dra_cpu_prepare_claims_error_total (counter)
  • dra_cpu_unprepare_claims_total (counter)
  • dra_cpu_prepare_claim_duration_seconds (histogram)

#73

@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 9, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: AutuSnow
Once this PR has been reviewed and has the lgtm label, please assign pohly for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from klueska and pohly March 9, 2026 13:32
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 9, 2026
@AutuSnow
Copy link
Copy Markdown
Contributor Author

AutuSnow commented Mar 9, 2026

cc @ffromani

@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Mar 9, 2026

Add 7 custom Prometheus metrics to the /metrics endpoint:

* dra_cpu_allocated_cpus (gauge)

* dra_cpu_available_cpus (gauge)

* dra_cpu_reserved_cpus (gauge)

* dra_cpu_resource_claims_active (gauge)

* dra_cpu_prepare_claims_success_total (counter)

* dra_cpu_prepare_claims_error_total (counter)

* dra_cpu_unprepare_claims_total (counter)

* dra_cpu_prepare_claim_duration_seconds (histogram)

please document the metrics in the README, adding another section.
It would be best to have a way to introspect the reported metrics programmatically (e.fg. new command line flag).
Finally, I think is gonna be worthy to have a histogram to expose how many cpus pods are using (e.g. 1-2, 3-4, 5-9 ...)

Last but not least, metrics are arguably a component API. It's so early in the life of the project that adding metrics is more worthy than get them right (if we can at all with the little data we have now), so please make sure to document that these metrics are provisional, not GA, and that we reserve the option to change them in the future.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Mar 9, 2026
@pohly
Copy link
Copy Markdown

pohly commented Mar 9, 2026

please make sure to document that these metrics are provisional, not GA, and that we reserve the option to change them in the future.

All metrics start out as alpha, are documented as such, and graduate once we have confidence that we got them right, independent of the feature which introduced them.

@AutuSnow
Copy link
Copy Markdown
Contributor Author

cc @ffromani @pravk03

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 17, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@AutuSnow
Copy link
Copy Markdown
Contributor Author

AutuSnow commented Mar 18, 2026

@pravk03 @ffromani Hi, could you please take a look at this PR? I've added some monitoring metrics that I feel are quite important (this hasn't been discussed), so I'd like to seek your opinions. If there are any missing or unnecessary metrics, I'm more than happy to modify my current PR

Signed-off-by: qiuxue <liuyutao36@gmail.com>
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@AutuSnow: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-dra-driver-cpu-e2e-device-mode-individual-amd64 c02bd6c link true /test pull-dra-driver-cpu-e2e-device-mode-individual-amd64
pull-dra-driver-cpu-e2e-device-mode-grouped-amd64 c02bd6c link true /test pull-dra-driver-cpu-e2e-device-mode-grouped-amd64
pull-dra-driver-cpu-e2e-device-mode-individual-arm64 c02bd6c link true /test pull-dra-driver-cpu-e2e-device-mode-individual-arm64
pull-dra-driver-cpu-e2e-device-mode-grouped-arm64 c02bd6c link true /test pull-dra-driver-cpu-e2e-device-mode-grouped-arm64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants