-
Notifications
You must be signed in to change notification settings - Fork 89
Description
I have a 4-way H100 SXM setup and in one of our tools, I'm seeing a weird issue when we destroy MIGs. Destroying CI works fine (default compute instance that uses up all the gpu instance) and then destroying its GI via the API says that they worked, but I see remnants of itself (notice the *):
:sudo nvidia-smi mig -lcip
+--------------------------------------------------------------------------------------+
| Compute instance profiles: |
| GPU GPU Name Profile Instances Exclusive Shared |
| Instance ID Free/Total SM DEC ENC OFA |
| ID CE JPEG |
|======================================================================================|
| 2 3 MIG 1c.2g.20gb 0 2/2 16 2 0 0 |
| 2 2 |
+--------------------------------------------------------------------------------------+
| 2 3 MIG 1c.2g.20gb 7 1/1 26 2 0 0 |
| 2 2 |
+--------------------------------------------------------------------------------------+
| 2 3 MIG 2g.20gb 1* 1/1 32 2 0 0 |
| 2 2 |
+--------------------------------------------------------------------------------------+
| 2 4 MIG 1c.2g.20gb 0 0/2 16 2 0 0 |
| 2 2 |
+--------------------------------------------------------------------------------------+
| 2 4 MIG 1c.2g.20gb 7 0/1 26 2 0 0 |
| 2 2 |
+--------------------------------------------------------------------------------------+
| 2 4 MIG 2g.20gb 1* 0/1 32 2 0 0 |
| 2 2 |
+--------------------------------------------------------------------------------------+
| 2 2 MIG 1c.3g.40gb 0 0/3 16 3 0 0 |
| 3 3 |
+--------------------------------------------------------------------------------------+
| 2 2 MIG 1c.3g.40gb 7 0/2 26 3 0 0 |
| 3 3 |
+--------------------------------------------------------------------------------------+
| 2 2 MIG 2c.3g.40gb 1 0/1 32 3 0 0 |
| 3 3 |
+--------------------------------------------------------------------------------------+
| 2 2 MIG 3g.40gb 2* 0/1 60 3 0 0 |
| 3 3 |
+--------------------------------------------------------------------------------------+
| 3 3 MIG 1c.2g.20gb 0 0/2 16 2 0 0 |
| 2 2 |
+--------------------------------------------------------------------------------------+
| 3 3 MIG 1c.2g.20gb 7 0/1 26 2 0 0 |
| 2 2 |
+--------------------------------------------------------------------------------------+
| 3 3 MIG 2g.20gb 1* 0/1 32 2 0 0 |
| 2 2 |
+--------------------------------------------------------------------------------------+
| 3 4 MIG 1c.2g.20gb 0 0/2 16 2 0 0 |
| 2 2 |
+--------------------------------------------------------------------------------------+
| 3 4 MIG 1c.2g.20gb 7 0/1 26 2 0 0 |
| 2 2 |
+--------------------------------------------------------------------------------------+
| 3 4 MIG 2g.20gb 1* 0/1 32 2 0 0 |
| 2 2 |
+--------------------------------------------------------------------------------------+
| 3 2 MIG 1c.3g.40gb 0 0/3 16 3 0 0 |
| 3 3 |
+--------------------------------------------------------------------------------------+
| 3 2 MIG 1c.3g.40gb 7 0/2 26 3 0 0 |
| 3 3 |
+--------------------------------------------------------------------------------------+
| 3 2 MIG 2c.3g.40gb 1 0/1 32 3 0 0 |
| 3 3 |
+--------------------------------------------------------------------------------------+
| 3 2 MIG 3g.40gb 2* 0/1 60 3 0 0 |
| 3 3 |
+--------------------------------------------------------------------------------------+
:sudo nvidia-smi mig -lgi
+-------------------------------------------------------+
| GPU instances: |
| GPU Name Profile Instance Placement |
| ID ID Start:Size |
|=======================================================|
| 2 MIG 2g.20gb 14 3 0:2 |
+-------------------------------------------------------+
| 2 MIG 2g.20gb 14 4 2:2 |
+-------------------------------------------------------+
| 2 MIG 3g.40gb 9 2 4:4 |
+-------------------------------------------------------+
| 3 MIG 2g.20gb 14 3 0:2 |
+-------------------------------------------------------+
| 3 MIG 2g.20gb 14 4 2:2 |
+-------------------------------------------------------+
| 3 MIG 3g.40gb 9 2 4:4 |
+-------------------------------------------------------+
:sudo nvidia-smi mig -lci
+--------------------------------------------------------------------+
| Compute instances: |
| GPU GPU Name Profile Instance Placement |
| Instance ID ID Start:Size |
| ID |
|====================================================================|
| 2 4 MIG 2g.20gb 1 0 0:2 |
+--------------------------------------------------------------------+
| 2 2 MIG 3g.40gb 2 0 0:4 |
+--------------------------------------------------------------------+
| 3 3 MIG 2g.20gb 1 0 0:2 |
+--------------------------------------------------------------------+
| 3 4 MIG 2g.20gb 1 0 0:2 |
+--------------------------------------------------------------------+
| 3 2 MIG 3g.40gb 2 0 0:4 |
+--------------------------------------------------------------------+
Notice how GPU instance 3 is still lying around, even though the API returned successfully. Trying to destroy it by hand also doesn't work:
:sudo nvidia-smi mig -dgi -gi 3 -i 2
Unable to destroy GPU instance ID 3 from GPU 2: In use by another client
Failed to destroy GPU instances: In use by another client
I also don't see anything using the corresponding capability device for this GPU instance (I don't want to do the same on /dev/nvidia2 as that has other MIGs on which processes would be running):
:grep gpu2/gi3 /proc/driver/nvidia-caps/mig-minors
gpu2/gi3/access 300
gpu2/gi3/ci0/access 301
gpu2/gi3/ci1/access 302
gpu2/gi3/ci2/access 303
gpu2/gi3/ci3/access 304
gpu2/gi3/ci4/access 305
gpu2/gi3/ci5/access 306
gpu2/gi3/ci6/access 307
gpu2/gi3/ci7/access 308
:lsof /dev/nvidia-caps/nvidia-cap300
What is happening here?
I'm using Nvidia driver 550.144.03, CUDA rt 12.2.140, go-nvml v0.12.4-0 and on Go 1.22.5.
Thanks in advance.