Skip to content

[Issue]: torch.cuda.mem_get_info function RuntimeError(HIP error: invalid argument) #330

@jaeyung2

Description

@jaeyung2

Problem Description

I'm try to run sample code in container after install gpu-operator helm chart on my cluster(1.32.8 k8s version). my test code is below. How can I solve this problem? I'd appreciate it if you could let me know.

import torch
if torch.cuda.is_available():
    free_memory_bytes, total_memory_bytes = torch.cuda.mem_get_info()

    free_memory_gb = free_memory_bytes / (1024**3)
    total_memory_gb = total_memory_bytes / (1024**3)

    print(f"Free GPU memory: {free_memory_gb:.2f} GB")
    print(f"Total GPU memory: {total_memory_gb:.2f} GB")

else:
    print("CUDA is not available. Cannot get GPU memory info.")

and error message is:

Traceback (most recent call last):
  File "/var/lib/jenkins/test.py", line 4, in <module>
    free_memory_bytes, total_memory_bytes = torch.cuda.mem_get_info()
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 738, in mem_get_info
    return torch.cuda.cudart().cudaMemGetInfo(device)
RuntimeError: HIP error: invalid argument
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

my pod yaml information:

apiVersion: v1
kind: Pod
metadata:
  name: rocm-pytorch-test
  namespace: kube-amd-gpu
  labels:
    app: rocm-pytorch-test
spec:
  restartPolicy: Never
  tolerations:
    - key: "amd.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"
  containers:
    - name: rocm-pytorch
      image: rocm/pytorch:rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0
      imagePullPolicy: IfNotPresent
      command: ["sleep", "infinity"]
      resources:
        limits:
          "amd.com/gpu": 8

Here is my OS and GPU information

OS: NAME="Ubuntu"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
CPU: 
model name	: AMD EPYC 7413 24-Core Processor
GPU:
  Name:                    AMD EPYC 7413 24-Core Processor    
  Marketing Name:          AMD EPYC 7413 24-Core Processor    
  Name:                    AMD EPYC 7413 24-Core Processor    
  Marketing Name:          AMD EPYC 7413 24-Core Processor    
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
  Name:                    gfx90a                             
  Marketing Name:          AMD Instinct MI250X/MI250          
      Name:                    amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-

Operating System

Ubuntu 22.04

CPU

AMD EPYC 7413 24-Core Processor

GPU

AMD Instinct MI250X/MI250

ROCm Version

ROCm 6.4.2

ROCm Component

HIP

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

Labels

RCA doneRoot Cause Analysis doneenhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions