Improved GPU selection mechanism #8
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
GPU scheduling
GPU's have their own binary tree-like data structure now, which helps in selecting neighbouring GPUs.
The algorithm preferrably selects neighbouring GPUs in pairs of 2 if the requested amount is even. If it's uneven, it selects the last GPU indepedently. If multiple pairs are needed, it'll preferrably choose neighbouring pairs.
The algorithm assumes that either each half of the GPUs is on the same PCIe segment (assuming 2 PCIe segments) or that all GPUs are on the same PCIe segment and that GPUs might be paired up through NVLink. These cases apply quite often. In these cases, this selection method accelerates the computation by taking advantage of better inter-GPU comms between the chosen GPUs when using multiple GPUs in one task.
If no GPU is already reserved at the start of a job, the scheduler will randomly pick a traversal direction (left or right). That will determine which PCIe segment (or half-segment of only one segment is used) the GPUs will be picked from.
If one or more GPUs on one segment are used, the scheduler will start from there (setting the traversal direction to the respective segment) and try to fill up that segment first if there's enough GPUs left on that segment. The outermost available GPU (pairs) in the traversal direction will be assigned first. If that segment does not have enough space left for the request, it'll use the other segment instead. If the amount of requested GPUs is bigger than a segment, it'll automatically include both in the search for pairs. In case both segments can't provide neighbouring GPU's for the request, it'll pick leftover GPUs regardless of their segment and pairing if there are enough GPUs left in total.
If any of the assumptions are incorrect, the program will still work, but it wont improve inter-GPU comms.
There's also an option to reserve GPU's for outside applications. Internally these are marked as eternally occupied then and wont be assigned by the queue system.
Structuring
A lot of scheduling-related code from qserver gets moved to scheduler.py. The job classes are now unified. Type annotations are added. I also added some test job scripts