-
Notifications
You must be signed in to change notification settings - Fork 39
Open
Open
Copy link
Labels
RCA doneRoot Cause Analysis doneRoot Cause Analysis doneenhancementNew feature or requestNew feature or request
Description
Problem Description
I'm try to run sample code in container after install gpu-operator helm chart on my cluster(1.32.8 k8s version). my test code is below. How can I solve this problem? I'd appreciate it if you could let me know.
import torch
if torch.cuda.is_available():
free_memory_bytes, total_memory_bytes = torch.cuda.mem_get_info()
free_memory_gb = free_memory_bytes / (1024**3)
total_memory_gb = total_memory_bytes / (1024**3)
print(f"Free GPU memory: {free_memory_gb:.2f} GB")
print(f"Total GPU memory: {total_memory_gb:.2f} GB")
else:
print("CUDA is not available. Cannot get GPU memory info.")and error message is:
Traceback (most recent call last):
File "/var/lib/jenkins/test.py", line 4, in <module>
free_memory_bytes, total_memory_bytes = torch.cuda.mem_get_info()
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 738, in mem_get_info
return torch.cuda.cudart().cudaMemGetInfo(device)
RuntimeError: HIP error: invalid argument
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
my pod yaml information:
apiVersion: v1
kind: Pod
metadata:
name: rocm-pytorch-test
namespace: kube-amd-gpu
labels:
app: rocm-pytorch-test
spec:
restartPolicy: Never
tolerations:
- key: "amd.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: rocm-pytorch
image: rocm/pytorch:rocm6.4.2_ubuntu22.04_py3.10_pytorch_release_2.6.0
imagePullPolicy: IfNotPresent
command: ["sleep", "infinity"]
resources:
limits:
"amd.com/gpu": 8Here is my OS and GPU information
OS: NAME="Ubuntu"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
CPU:
model name : AMD EPYC 7413 24-Core Processor
GPU:
Name: AMD EPYC 7413 24-Core Processor
Marketing Name: AMD EPYC 7413 24-Core Processor
Name: AMD EPYC 7413 24-Core Processor
Marketing Name: AMD EPYC 7413 24-Core Processor
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Name: gfx90a
Marketing Name: AMD Instinct MI250X/MI250
Name: amdgcn-amd-amdhsa--gfx90a:sramecc+:xnack-
Operating System
Ubuntu 22.04
CPU
AMD EPYC 7413 24-Core Processor
GPU
AMD Instinct MI250X/MI250
ROCm Version
ROCm 6.4.2
ROCm Component
HIP
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
insukim1994
Metadata
Metadata
Assignees
Labels
RCA doneRoot Cause Analysis doneRoot Cause Analysis doneenhancementNew feature or requestNew feature or request