-
Notifications
You must be signed in to change notification settings - Fork 7
Scheduled Halo Exchange #980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
cscs-ci run default |
|
cscs-ci run extra |
|
|
||
| class ExchangeResult(Protocol): | ||
| def wait(self) -> None: ... | ||
| def wait(self, stream: Any | type[NoStream] | None = NoStream) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case you are interested in early feedback. What does None mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I am.
It means default stream inspired from nullptr in C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it feels more intuitive to do None == wait (i.e. the current NoStream), and then use Stream.null for the default stream, see https://docs.cupy.dev/en/stable/reference/generated/cupy.cuda.Stream.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the None for `default stream is not really a good idea and is more motivated from C++.
I will change that in GHEX and also implement the protocol, thanks for the suggestions.
|
|
||
| class ExchangeResult(Protocol): | ||
| def wait(self) -> None: ... | ||
| def wait(self, stream: Any | type[NoStream] | None = NoStream) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e.
| def wait(self, stream: Any | type[NoStream] | None = NoStream) -> None: | |
| def wait(self, stream: cuda.Stream | None = cuda.Stream.null) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we have to figure out the right abstractions for non-cuda case...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used Any more as a "please help me", so my question may now sound naïve, but with cuda.Stream you mean CuPy?
So should I import cupy directly or should I create a protocol somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@havogt
After some thinking I would do the following:
- Changing the meaning of
Noneto "do not use the scheduling implementation; I agree this is the better solution. - `create two protocols for the stream one following the Nvidia protocol and one following the CuPy for getting the cuda stream.
- Creating a singelton object to represent use the "default stream", i.e. turning
NoStreamintoDefaultStream.
The good thing is that nothing GPU related needs to be present, so no strange import errors.
Furthermore, the user can use both CUDA streams and CuPy streams.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now implemented my "interpretation of your suggestions".
However, for the default stream I do not use cuda.Stream.null because this would mean that we have to import CuPy/Cuda all the time, which I do not like.
But maybe I am missing something.
|
cscs-ci run default |
|
cscs-ci run dace |
|
cscs-ci run extra |
**NOTE:** This commit still follows the old nomoclature, where `None` means default stream. Most likely this will change such that `None` means "not using `schedule_*()` functions and another sigelton is used for it.
|
cscs-ci run default |
- There are now two protocols that describes how to extract the underlying address. They are probably at the wrong location. - `stream=None` no longer means "default stream" but is not equivalent to "do not use scheduled version". - To indicate the default stream the singelton `DefaultStream` is used. The `cupy.cuda.Stream.null` singelton was not used, because it would require that `cupy` is present. - However, use the default stream is still the default behaviour.
|
cscs-ci run default |
|
cscs-ci run dace |
|
cscs-ci run extra |
|
cscs-ci run default |
|
cscs-ci run dace |
|
cscs-ci run extra |
|
There is a failing in See this test PR: #982 |
|
cscs-ci run default |
|
cscs-ci run dace |
|
cscs-ci run default |
|
cscs-ci run default |
|
cscs-ci run dace |
|
cscs-ci run extra |
|
cscs-ci run default |
|
cscs-ci run dace |
|
cscs-ci run extra |
|
cscs-ci run default |
|
cscs-ci run dace |
|
cscs-ci run extra |
|
Mandatory Tests Please make sure you run these tests via comment before you merge!
Optional Tests To run benchmarks you can use:
To run tests and benchmarks with the DaCe backend you can use:
To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:
For more detailed information please look at CI in the EXCLAIM universe. |
|
cscs-ci run extra |
|
cscs-ci run default |
|
cscs-ci run dace |
|
cscs-ci run default |
|
cscs-ci run dace |
|
cscs-ci run extra |
This PR introduces the "scheduled" exchange feature from GHEX into ICON4Py.
Scheduled exchanges allows to call the exchange function (the function responsible for packing the data and sending it) before the work has concluded.
The packing will only start when all work that was previously submitted to a stream has been finished.
In addition the scheduled wait function (the function responsible for unpacking the data) does not wait until all data has been unpacked, instead it will start it and then synchronize with the provided stream.
The default behaviour is to use scheduled exchange function and synchronize with the default stream.
The function extends ICON4Py's decomposition concepts and adds the keyword only argument
streamto theexchange()andwait()functions.This is the stream with which the exchange/wait should synchronize with.
To deactivate the feature, i.e. send immediately and wait until unpacking has been done, one can pass
None.The PR introduces the
CupyLikeStreamandCudaStreamProtocolLikeprotocols, that allows to extract a C GPU stream from a Python object.In addition the
DefaultStreamsingleton is also added, which indicates to use the default stream.The reason for this is to avoid to import CuPy/CUDA directly such that ICON4Py also works in a pure CPU mode.
Note:
If only CPU memory is exchanged then the behavior is the same as before, i.e.
exchange()starts sending immediately andwait()only returns after everything has been unpacked.If CPU and GPU memory is exchanged the behaviour is kind of the "intersection" between the two.
It is fine, but generally not recommended.
DONE:
TODO: