Skip to content

[Bug] MIG destroys are not working correctly at times #146

@bergentruckung

Description

@bergentruckung

I have a 4-way H100 SXM setup and in one of our tools, I'm seeing a weird issue when we destroy MIGs. Destroying CI works fine (default compute instance that uses up all the gpu instance) and then destroying its GI via the API says that they worked, but I see remnants of itself (notice the *):

:sudo nvidia-smi mig -lcip                                                                                                                                                                                                                                        
+--------------------------------------------------------------------------------------+
| Compute instance profiles:                                                           |
| GPU     GPU       Name             Profile  Instances   Exclusive       Shared       |
|       Instance                       ID     Free/Total     SM       DEC   ENC   OFA  |
|         ID                                                          CE    JPEG       |
|======================================================================================|
|   2      3       MIG 1c.2g.20gb       0      2/2           16        2     0     0   |
|                                                                      2     2         |
+--------------------------------------------------------------------------------------+
|   2      3       MIG 1c.2g.20gb       7      1/1           26        2     0     0   |
|                                                                      2     2         |
+--------------------------------------------------------------------------------------+
|   2      3       MIG 2g.20gb          1*     1/1           32        2     0     0   |
|                                                                      2     2         |
+--------------------------------------------------------------------------------------+
|   2      4       MIG 1c.2g.20gb       0      0/2           16        2     0     0   |
|                                                                      2     2         |
+--------------------------------------------------------------------------------------+
|   2      4       MIG 1c.2g.20gb       7      0/1           26        2     0     0   |
|                                                                      2     2         |
+--------------------------------------------------------------------------------------+
|   2      4       MIG 2g.20gb          1*     0/1           32        2     0     0   |
|                                                                      2     2         |
+--------------------------------------------------------------------------------------+
|   2      2       MIG 1c.3g.40gb       0      0/3           16        3     0     0   |
|                                                                      3     3         |
+--------------------------------------------------------------------------------------+
|   2      2       MIG 1c.3g.40gb       7      0/2           26        3     0     0   |
|                                                                      3     3         |
+--------------------------------------------------------------------------------------+
|   2      2       MIG 2c.3g.40gb       1      0/1           32        3     0     0   |
|                                                                      3     3         |
+--------------------------------------------------------------------------------------+
|   2      2       MIG 3g.40gb          2*     0/1           60        3     0     0   |
|                                                                      3     3         |
+--------------------------------------------------------------------------------------+
|   3      3       MIG 1c.2g.20gb       0      0/2           16        2     0     0   |
|                                                                      2     2         |
+--------------------------------------------------------------------------------------+
|   3      3       MIG 1c.2g.20gb       7      0/1           26        2     0     0   |
|                                                                      2     2         |
+--------------------------------------------------------------------------------------+
|   3      3       MIG 2g.20gb          1*     0/1           32        2     0     0   |
|                                                                      2     2         |
+--------------------------------------------------------------------------------------+
|   3      4       MIG 1c.2g.20gb       0      0/2           16        2     0     0   |
|                                                                      2     2         |
+--------------------------------------------------------------------------------------+
|   3      4       MIG 1c.2g.20gb       7      0/1           26        2     0     0   |
|                                                                      2     2         |
+--------------------------------------------------------------------------------------+
|   3      4       MIG 2g.20gb          1*     0/1           32        2     0     0   |
|                                                                      2     2         |
+--------------------------------------------------------------------------------------+
|   3      2       MIG 1c.3g.40gb       0      0/3           16        3     0     0   |
|                                                                      3     3         |
+--------------------------------------------------------------------------------------+
|   3      2       MIG 1c.3g.40gb       7      0/2           26        3     0     0   |
|                                                                      3     3         |
+--------------------------------------------------------------------------------------+
|   3      2       MIG 2c.3g.40gb       1      0/1           32        3     0     0   |
|                                                                      3     3         |
+--------------------------------------------------------------------------------------+
|   3      2       MIG 3g.40gb          2*     0/1           60        3     0     0   |
|                                                                      3     3         |
+--------------------------------------------------------------------------------------+
:sudo nvidia-smi mig -lgi                                            
+-------------------------------------------------------+            
| GPU instances:                                        |            
| GPU   Name             Profile  Instance   Placement  |            
|                          ID       ID       Start:Size |            
|=======================================================|            
|   2  MIG 2g.20gb         14        3          0:2     |            
+-------------------------------------------------------+            
|   2  MIG 2g.20gb         14        4          2:2     |            
+-------------------------------------------------------+            
|   2  MIG 3g.40gb          9        2          4:4     |            
+-------------------------------------------------------+            
|   3  MIG 2g.20gb         14        3          0:2     |            
+-------------------------------------------------------+            
|   3  MIG 2g.20gb         14        4          2:2     |            
+-------------------------------------------------------+            
|   3  MIG 3g.40gb          9        2          4:4     |            
+-------------------------------------------------------+            
:sudo nvidia-smi mig -lci                                              
+--------------------------------------------------------------------+ 
| Compute instances:                                                 | 
| GPU     GPU       Name             Profile   Instance   Placement  | 
|       Instance                       ID        ID       Start:Size | 
|         ID                                                         | 
|====================================================================| 
|   2      4       MIG 2g.20gb          1         0          0:2     | 
+--------------------------------------------------------------------+ 
|   2      2       MIG 3g.40gb          2         0          0:4     | 
+--------------------------------------------------------------------+ 
|   3      3       MIG 2g.20gb          1         0          0:2     | 
+--------------------------------------------------------------------+ 
|   3      4       MIG 2g.20gb          1         0          0:2     | 
+--------------------------------------------------------------------+ 
|   3      2       MIG 3g.40gb          2         0          0:4     | 
+--------------------------------------------------------------------+ 

Notice how GPU instance 3 is still lying around, even though the API returned successfully. Trying to destroy it by hand also doesn't work:

:sudo nvidia-smi mig -dgi -gi 3 -i 2                                      
Unable to destroy GPU instance ID  3 from GPU  2: In use by another client
Failed to destroy GPU instances: In use by another client                 

I also don't see anything using the corresponding capability device for this GPU instance (I don't want to do the same on /dev/nvidia2 as that has other MIGs on which processes would be running):

:grep gpu2/gi3 /proc/driver/nvidia-caps/mig-minors  
gpu2/gi3/access 300                                 
gpu2/gi3/ci0/access 301                             
gpu2/gi3/ci1/access 302                             
gpu2/gi3/ci2/access 303                             
gpu2/gi3/ci3/access 304                             
gpu2/gi3/ci4/access 305                             
gpu2/gi3/ci5/access 306                             
gpu2/gi3/ci6/access 307                             
gpu2/gi3/ci7/access 308                             
:lsof /dev/nvidia-caps/nvidia-cap300                

What is happening here?

I'm using Nvidia driver 550.144.03, CUDA rt 12.2.140, go-nvml v0.12.4-0 and on Go 1.22.5.

Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions