-
Notifications
You must be signed in to change notification settings - Fork 89
Open
Description
when running the dcgm-exporter in a kubernetes pod as a daemonset, we observed that in a particular cluster of H200 nodes running driver version 570.172.08, a certain number were exporting no metrics.
the startup logs:
time=2025-09-09T20:19:25.097Z level=INFO msg="Starting dcgm-exporter" Version=4.3.1-4.4.1
time=2025-09-09T20:19:25.100Z level=INFO msg="Attempting to initialize DCGM."
time=2025-09-09T20:19:25.194Z level=INFO msg="Initialized DCGM Fields module."
time=2025-09-09T20:19:25.194Z level=INFO msg="Attempting to initialize NVML library."
time=2025-09-09T20:19:25.195Z level=ERROR msg="Cannot init NVML library; err: Unknown Error"
time=2025-09-09T20:19:25.195Z level=INFO msg="DCGM successfully initialized!"
time=2025-09-09T20:19:25.195Z level=INFO msg="NVML provider successfully initialized!"
time=2025-09-09T20:19:26.616Z level=INFO msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time=2025-09-09T20:19:26.616Z level=INFO msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics.csv'"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 44 ('DCGM_FI_PROF_NVLINK_TX_BYTES'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 45 ('DCGM_FI_PROF_NVLINK_RX_BYTES'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 56 ('DCGM_FI_PROF_SM_ACTIVE'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 57 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 60 ('DCGM_FI_PROF_PIPE_FP64_ACTIVE'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 61 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 62 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=INFO msg="Initializing system entities of type 'GPU'"
time=2025-09-09T20:19:26.616Z level=INFO msg="Not collecting GPU metrics; error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system."
time=2025-09-09T20:19:26.616Z level=INFO msg="Initializing system entities of type 'NvSwitch'"
time=2025-09-09T20:19:26.694Z level=INFO msg="Not collecting NvSwitch metrics; no switches to monitor"
time=2025-09-09T20:19:26.694Z level=INFO msg="Initializing system entities of type 'NvLink'"
time=2025-09-09T20:19:26.694Z level=INFO msg="Not collecting NvLink metrics; no switches to monitor"
time=2025-09-09T20:19:26.694Z level=INFO msg="Initializing system entities of type 'CPU'"
time=2025-09-09T20:19:35.915Z level=INFO msg="Not collecting CPU metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time=2025-09-09T20:19:35.915Z level=INFO msg="Initializing system entities of type 'CPU Core'"
time=2025-09-09T20:19:35.915Z level=INFO msg="Not collecting CPU Core metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time=2025-09-09T20:19:35.915Z level=INFO msg="Kubernetes metrics collection enabled!"
time=2025-09-09T20:19:35.916Z level=INFO msg="Starting webserver"
time=2025-09-09T20:19:35.994Z level=INFO msg="Listening on" address=[::]:9400
time=2025-09-09T20:19:35.994Z level=INFO msg="TLS is disabled." http2=false address=[::]:9400
time=2025-09-09T20:19:35.994Z level=INFO msg="Watching for changes in file" file=/etc/dcgm-exporter/dcp-metrics.csv
restarting the exporter causes the metrics to appear. i haven't observed this behaviour before from many versions of dcgm exporter and nvidia drivers.
could this be related to whatever is being investigated internally under #152 ?
salanki, pomyslowynick and mymmrac
Metadata
Metadata
Assignees
Labels
No labels