-
Notifications
You must be signed in to change notification settings - Fork 46
Description
Hi,
I am trying to simulate a node with MIG instances and monitor them using the integrated dcgm-exporter. While I can successfully make the MIG instances appear on the node and be available for scheduling, I cannot get any Prometheus metrics for them.
What I Tried
I configured the values.yaml to define MIG instances using the otherDevices key, as discovered from the source code. For example:
# values.yaml
topology:
nodePools:
default:
gpuProduct: "NVIDIA-A100-SXM4-80GB"
gpuCount: 1
gpuMemory: 80000
otherDevices:
- name: "nvidia.com/mig-1g.10gb"
count: 7
I applied this configuration via Helm.
Actual Behavior (The Problem)
The MIG resources are correctly advertised on the node and pods can be scheduled on them. However, the dcgm-exporter's /metrics endpoint does not show any metrics for the nvidia.com/mig-1g.10gb instances. It only exports metrics for full GPUs if gpuCount is greater than 0.
What I found
After digging into the source code. In internal/status-exporter/export/metrics/exporter.go, the export function contains the following loop:
for gpuIdx, gpu := range nodeTopology.Gpus {
// ... exports metrics for the full GPU ...
}
This code only iterates over the "gpus" field from the node topology.
So I labeled the nodes with "node-role.kubernetes.io/runai-dynamic-mig=true" and "node-role.kubernetes.io/runai-mig-enabled=true" and also added the following annotation:
run.ai/mig.config: |-
version: v1
mig-configs:
selected:
- devices: [0]
mig-enabled: true
mig-devices:
- name: 1g.10gb
position: 0
size: 1
- name: 1g.10gb
position: 1
size: 1
- name: 1g.10gb
position: 2
size: 1
- name: 1g.10gb
position: 3
size: 1
- name: 1g.10gb
position: 4
size: 1
- name: 1g.10gb
position: 5
size: 1
- name: 1g.10gb
position: 6
size: 1
I got the annotation "run.ai/mig-mapping" and the label "nvidia.com/mig.config.state=success", but it didn’t register the GPU instances on the “gpus” section in the node topology and didn't get any error from the logs of the pods, in fact mig-faker says "Successfuly updated MIG config". I don’t know how to add the GPU instances on the topology in the “gpus” section, only in otherDevices.
Question
Do I misunderstand something? Is there a different, undocumented workflow for enabling MIG monitoring?
Thanks for the great work done on this project!