Skip to content

fix: skip KWOK PrometheusRule when CRD is not available#171

Merged
eliranw merged 4 commits intomainfrom
fix/kwok-prometheusrule-capability-check
Feb 23, 2026
Merged

fix: skip KWOK PrometheusRule when CRD is not available#171
eliranw merged 4 commits intomainfrom
fix/kwok-prometheusrule-capability-check

Conversation

@eliranw
Copy link
Contributor

@eliranw eliranw commented Feb 23, 2026

Summary

  • Add Capabilities.APIVersions check to the KWOK PrometheusRule template
  • Prevents Helm from rendering the PrometheusRule on clusters without the Prometheus Operator CRD installed
  • Fixes integration test failures

Test plan

  • Integration tests pass on clusters without Prometheus Operator
  • PrometheusRule is still created on clusters with Prometheus Operator

eliranw and others added 3 commits February 22, 2026 17:36
The centralized KWOK status-exporter produces DCGM metrics for virtual
GPU nodes, but RunAI's recording rules derive the node label by joining
with kube_pod_info on the exporter pod IP. Since the centralized pod
runs on a system node, KWOK GPU metrics were attributed to the wrong
node and never appeared in the UI.

Changes:
- Set Hostname label to the actual KWOK node name in multi_node_exporter
  so metrics can be correlated to the correct virtual node
- Add a dedicated Service for the KWOK exporter with component selector
  to prevent the DaemonSet service from also matching KWOK pods
- Add component label to DaemonSet pod template to distinguish from KWOK
- Create a PrometheusRule (deployed to runai namespace) that produces
  runai_dcgm_gpu_utilization, runai_dcgm_gpu_used_mebibytes, and
  runai_dcgm_gpu_total_mebibytes recording rules using Hostname as the
  node label, bypassing the kube_pod_info join

Fixes: RUN-36987
Add statusExporter.kwok.prometheusRule.enabled (default true) to allow
disabling the KWOK DCGM recording rules without affecting the rest of
the status-exporter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add .Capabilities.APIVersions check so the KWOK PrometheusRule is only
rendered on clusters that have the Prometheus Operator installed. Fixes
integration test failures on clusters without the CRD.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@eliranw eliranw requested a review from a team as a code owner February 23, 2026 13:32
@eliranw eliranw merged commit c4cffd1 into main Feb 23, 2026
6 of 7 checks passed
@eliranw eliranw deleted the fix/kwok-prometheusrule-capability-check branch February 23, 2026 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant