Skip to content

[FlowInsight] The error in obtaining the host machine pid in the container causes the physical view GPU information to be incorrect. #659

@zhouhansheng

Description

@zhouhansheng

What happened + What you expected to happen

https://github.com/antgroup/ant-ray/blob/main/python/ray/dashboard/modules/reporter/reporter_agent.py#L640C12-L662C34

for container_pid in proc_dirs:
    sched_path = f"/proc/{container_pid}/sched"
    if os.path.exists(sched_path):
        try:
            with open(sched_path, "r") as f:
                first_line = f.readline()
                # Extract host PID using regex
                match = pattern.search(first_line)
                if match:
                    host_pid = int(match.group(1))
                    # Only store if it's one of the PIDs we're looking for
                    if host_pid in host_pids_to_find:
                        host_to_container_pid_map[host_pid] = int(
                            container_pid
                        )
                        # If we've found all the PIDs we need, we can stop searching
                        if len(host_to_container_pid_map) == len(
                            host_pids_to_find
                        ):
                            break
        except (IOError, ValueError):
            # Skip files we can't read or parse
            continue

Here, by traversing the /proc/{container_pid}/sched file, the host pid is obtained from the pid in the first line. This should be a bug. The pid recorded in the first line of /proc/{container_pid}/sched should only be the pid in the container.

Process pid using the GPU in the container:
Image

Corresponding host process pid:
Image

The physical view GPU information is missing due to the host machine PID error obtained in the container:
Image

If you start the container and specify --pid=host, the gpu indicators can be displayed normally:
Image

Versions / Dependencies

ant-ray/main

Reproduction script

The startup command that have the bug:
docker run -p 28268:28268 --runtime=nvidia -itd --name own-dashboard --shm-size=16gb --privileged --network dockerBridge ant-ray:verl040-raymain bash

The container startup command without the bug is:
docker run -p 28268:28268 --runtime=nvidia --pid=host -itd --name own-dashboard --shm-size=16gb --privileged --network dockerBridge ant-ray:verl040-raymain bash

The different parameter is --pid=host.

Issue Severity

Low: It annoys or frustrates me.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions