Skip to content

fix: KWOK node GPU metrics not visible in RunAI UI#170

Merged
eliranw merged 2 commits intomainfrom
fix/kwok-metrics-service-monitor
Feb 23, 2026
Merged

fix: KWOK node GPU metrics not visible in RunAI UI#170
eliranw merged 2 commits intomainfrom
fix/kwok-metrics-service-monitor

Conversation

@eliranw
Copy link
Contributor

@eliranw eliranw commented Feb 22, 2026

Summary

  • Fix KWOK virtual GPU node metrics not appearing in RunAI UI (GPU utilization, memory, free devices)
  • The root cause: RunAI's dcgmMetricsRule recording rules join DCGM metrics with kube_pod_info on pod_ip to derive the node label. For the centralized KWOK exporter, this maps metrics to the system node where the pod runs, not to the virtual KWOK node.

Changes

  • multi_node_exporter.go: Override the Hostname label with the actual KWOK node name (instead of a hash) so metrics can be correlated to the correct virtual node
  • kwok-service.yaml (new): Dedicated Service for the KWOK exporter with component: status-exporter-kwok selector
  • kwok-prometheusrule.yaml (new): PrometheusRule deployed to runai namespace that creates runai_dcgm_gpu_utilization, runai_dcgm_gpu_used_mebibytes, and runai_dcgm_gpu_total_mebibytes recording rules using Hostname directly as the node label, bypassing the kube_pod_info join
  • service.yaml: Added component: status-exporter to selector to prevent the DaemonSet service from matching KWOK pods (avoids double-scraping)
  • _helpers.tpl: Added component: status-exporter label to DaemonSet pod template

How it works

The PrometheusRule uses label_replace to set node from the Hostname label, then joins with runai_node_nodepool_excluded to add the nodepool label. Regular (non-KWOK) metrics have a hash-based Hostname that doesn't match any real node, so they are naturally filtered out by the join.

Test plan

  • Deploy to test cluster with KWOK node
  • Verify runai_dcgm_gpu_utilization, runai_dcgm_gpu_used_mebibytes, runai_dcgm_gpu_total_mebibytes show correct values for KWOK nodes in Prometheus
  • Verify regular (non-KWOK) GPU node metrics are unaffected
  • Verify no duplicate metrics from double-scraping
  • Verify GPU type, Ready/total GPU devices, GPU memory visible in RunAI UI for KWOK nodes

The centralized KWOK status-exporter produces DCGM metrics for virtual
GPU nodes, but RunAI's recording rules derive the node label by joining
with kube_pod_info on the exporter pod IP. Since the centralized pod
runs on a system node, KWOK GPU metrics were attributed to the wrong
node and never appeared in the UI.

Changes:
- Set Hostname label to the actual KWOK node name in multi_node_exporter
  so metrics can be correlated to the correct virtual node
- Add a dedicated Service for the KWOK exporter with component selector
  to prevent the DaemonSet service from also matching KWOK pods
- Add component label to DaemonSet pod template to distinguish from KWOK
- Create a PrometheusRule (deployed to runai namespace) that produces
  runai_dcgm_gpu_utilization, runai_dcgm_gpu_used_mebibytes, and
  runai_dcgm_gpu_total_mebibytes recording rules using Hostname as the
  node label, bypassing the kube_pod_info join

Fixes: RUN-36987
@eliranw eliranw requested a review from a team as a code owner February 22, 2026 15:36
Copy link
Contributor Author

@eliranw eliranw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-documented fix. The approach of creating parallel PrometheusRules with direct Hostname-to-node mapping is a good solution for the KWOK centralized exporter scenario. A few minor observations below.

Add statusExporter.kwok.prometheusRule.enabled (default true) to allow
disabling the KWOK DCGM recording rules without affecting the rest of
the status-exporter.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@eliranw eliranw merged commit 63b4db4 into main Feb 23, 2026
3 of 5 checks passed
@eliranw eliranw deleted the fix/kwok-metrics-service-monitor branch February 23, 2026 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants