Skip to content

Conversation

@denseishin
Copy link

GPU scheduling

GPU's have their own binary tree-like data structure now, which helps in selecting neighbouring GPUs.
The algorithm preferrably selects neighbouring GPUs in pairs of 2 if the requested amount is even. If it's uneven, it selects the last GPU indepedently. If multiple pairs are needed, it'll preferrably choose neighbouring pairs.
The algorithm assumes that either each half of the GPUs is on the same PCIe segment (assuming 2 PCIe segments) or that all GPUs are on the same PCIe segment and that GPUs might be paired up through NVLink. These cases apply quite often. In these cases, this selection method accelerates the computation by taking advantage of better inter-GPU comms between the chosen GPUs when using multiple GPUs in one task.
If no GPU is already reserved at the start of a job, the scheduler will randomly pick a traversal direction (left or right). That will determine which PCIe segment (or half-segment of only one segment is used) the GPUs will be picked from.
If one or more GPUs on one segment are used, the scheduler will start from there (setting the traversal direction to the respective segment) and try to fill up that segment first if there's enough GPUs left on that segment. The outermost available GPU (pairs) in the traversal direction will be assigned first. If that segment does not have enough space left for the request, it'll use the other segment instead. If the amount of requested GPUs is bigger than a segment, it'll automatically include both in the search for pairs. In case both segments can't provide neighbouring GPU's for the request, it'll pick leftover GPUs regardless of their segment and pairing if there are enough GPUs left in total.
If any of the assumptions are incorrect, the program will still work, but it wont improve inter-GPU comms.
There's also an option to reserve GPU's for outside applications. Internally these are marked as eternally occupied then and wont be assigned by the queue system.

Structuring

A lot of scheduling-related code from qserver gets moved to scheduler.py. The job classes are now unified. Type annotations are added. I also added some test job scripts

mgarbade and others added 7 commits January 14, 2025 14:32
The algorithm is added to improve inter-GPU communications and therefore to improve speed of multi-GPU jobs compared to the previous version.
The algorithm preferrably selects neighbouring GPUs in pairs of 2 if the requested amount is even. If it's uneven, it selects the last GPU indepedently.
The algorithm assumes that either each half of the GPUs are on the same NUMA node or that all GPUs are on the same NUMA node and that GPUs might be paired up through NVLink. These cases apply quite often. In these cases, it accelerates the computation by taking advantage of better inter-GPU comms when using multiple GPUs in one task.
If any of the assumptions are incorrect, the program will still work, but it wont improve inter-GPU comms.

Move scheduling-related code in qserver to scheduler.py
Add test jobs
Further refactoring might follow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants