Skip to content

Race condition in multi‑GPU allocation when multiple processes/containers share device visibility #73

@wirlessBrain

Description

@wirlessBrain

When multiple GPUs are connected to a host, our algorithm selects a list of devices to run a network that requires multiple GPUs. These devices/gpus are then used for inference.

The problem arises when two processes run simultaneously (either directly on the host or inside separate containers) with visibility to all GPUs. Currently, there is no mechanism for one process to know that certain GPUs have already been picked and reserved by another process.

This leads to a race condition:

Process 1 selects a set of GPUs and begins inference.

Process 2, unaware of Process 1’s allocation, may also select overlapping GPUs.

Both processes attempt to use the same devices, causing conflicts, degraded performance, or failures.

Question: If we expose visibility of all GPUs to more than one container, what mechanisms exist to prevent race conditions in GPU allocation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions