Consider a variable of type VariableCellArrayReal where the second dimension is high (for example 38400).
When we want to synchronize this variable between GPUs, the messages are packed and unpacked on GPU using the Accelerator API:
|
void _copyFrom(const RunQueue* queue, SmallSpan<const Int32> indexes, |
.
In the case where the number of items (nb_index) is low and the second dimension (sub_size) is really high, the _copyFrom and _copyTo methods are expansive (because not enough parallelism).
Is it possible to parallelize both on nb_index and sub_size thanks to a RUNCOMMAND_LOOP2?