Skip to content

nvml.Init() failing #164

@daveoy

Description

@daveoy

when running the dcgm-exporter in a kubernetes pod as a daemonset, we observed that in a particular cluster of H200 nodes running driver version 570.172.08, a certain number were exporting no metrics.

the startup logs:

time=2025-09-09T20:19:25.097Z level=INFO msg="Starting dcgm-exporter" Version=4.3.1-4.4.1
time=2025-09-09T20:19:25.100Z level=INFO msg="Attempting to initialize DCGM."
time=2025-09-09T20:19:25.194Z level=INFO msg="Initialized DCGM Fields module."
time=2025-09-09T20:19:25.194Z level=INFO msg="Attempting to initialize NVML library."
time=2025-09-09T20:19:25.195Z level=ERROR msg="Cannot init NVML library; err: Unknown Error"
time=2025-09-09T20:19:25.195Z level=INFO msg="DCGM successfully initialized!"
time=2025-09-09T20:19:25.195Z level=INFO msg="NVML provider successfully initialized!"
time=2025-09-09T20:19:26.616Z level=INFO msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time=2025-09-09T20:19:26.616Z level=INFO msg="Falling back to metric file '/etc/dcgm-exporter/dcp-metrics.csv'"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 44 ('DCGM_FI_PROF_NVLINK_TX_BYTES'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 45 ('DCGM_FI_PROF_NVLINK_RX_BYTES'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 56 ('DCGM_FI_PROF_SM_ACTIVE'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 57 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 60 ('DCGM_FI_PROF_PIPE_FP64_ACTIVE'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 61 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 62 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=WARN msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time=2025-09-09T20:19:26.616Z level=INFO msg="Initializing system entities of type 'GPU'"
time=2025-09-09T20:19:26.616Z level=INFO msg="Not collecting GPU metrics; error getting devices count: Cannot perform the requested operation because NVML doesn't exist on this system."
time=2025-09-09T20:19:26.616Z level=INFO msg="Initializing system entities of type 'NvSwitch'"
time=2025-09-09T20:19:26.694Z level=INFO msg="Not collecting NvSwitch metrics; no switches to monitor"
time=2025-09-09T20:19:26.694Z level=INFO msg="Initializing system entities of type 'NvLink'"
time=2025-09-09T20:19:26.694Z level=INFO msg="Not collecting NvLink metrics; no switches to monitor"
time=2025-09-09T20:19:26.694Z level=INFO msg="Initializing system entities of type 'CPU'"
time=2025-09-09T20:19:35.915Z level=INFO msg="Not collecting CPU metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time=2025-09-09T20:19:35.915Z level=INFO msg="Initializing system entities of type 'CPU Core'"
time=2025-09-09T20:19:35.915Z level=INFO msg="Not collecting CPU Core metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time=2025-09-09T20:19:35.915Z level=INFO msg="Kubernetes metrics collection enabled!"
time=2025-09-09T20:19:35.916Z level=INFO msg="Starting webserver"
time=2025-09-09T20:19:35.994Z level=INFO msg="Listening on" address=[::]:9400
time=2025-09-09T20:19:35.994Z level=INFO msg="TLS is disabled." http2=false address=[::]:9400
time=2025-09-09T20:19:35.994Z level=INFO msg="Watching for changes in file" file=/etc/dcgm-exporter/dcp-metrics.csv

restarting the exporter causes the metrics to appear. i haven't observed this behaviour before from many versions of dcgm exporter and nvidia drivers.

could this be related to whatever is being investigated internally under #152 ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions