Race condition in multi‑GPU allocation when multiple processes/containers share device visibility

When multiple GPUs are connected to a host, our algorithm selects a list of devices to run a network that requires multiple GPUs. These devices/gpus  are then used for inference.

The problem arises when two processes run simultaneously (either directly on the host or inside separate containers) with visibility to all GPUs. Currently, there is no mechanism for one process to know that certain GPUs have already been picked and reserved by another process.

This leads to a race condition:

Process 1 selects a set of GPUs and begins inference.

Process 2, unaware of Process 1’s allocation, may also select overlapping GPUs.

Both processes attempt to use the same devices, causing conflicts, degraded performance, or failures.

Question: If we expose visibility of all GPUs to more than one container, what mechanisms exist to prevent race conditions in GPU allocation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Race condition in multi‑GPU allocation when multiple processes/containers share device visibility #73

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race condition in multi‑GPU allocation when multiple processes/containers share device visibility #73

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions