-
Notifications
You must be signed in to change notification settings - Fork 1
Modified sycl reduction algorithm #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
uphoffc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ntload(&buffer[id]);
Just use a regular load unless you have evidence that the specific "ntload" boosts performance across various platforms.
idx.barrier(sycl::access::fence_space::local_space);
Not necessary, can be removed.
davschneller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating it like this. 10e7 plasticity cells might still occur at least in GTS. :D
So, ACPP (presumably) doesn't have a wrapper for CUDA yet in these min/max cases? Does it work with sm_75? (or with integers?)
Running the tests themselves is alas not possible right now—I've set up a self-hosted runner for that which should manage running some basic NVIDIA/Intel tests, but it doesn't work quite yet (if it does, I'll also run the respective SeisSol tests).
algorithms/sycl/Reduction.cpp
Outdated
| idx.barrier(sycl::access::fence_space::local_space); | ||
|
|
||
| value = sycl::reduce_over_group(subgroup, value, operation); | ||
| auto reducedValue = sycl::reduce_over_group(idx.get_group(), threadAcc, operation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
algorithms/sycl/Reduction.cpp
Outdated
| for (std::size_t i = currentWarp; i < warpsNeeded; i += warpCount) { | ||
| const auto id = threadInWarp + i * sgSize; | ||
| auto value = (id < size) ? static_cast<AccT>(ntload(&buffer[id])) : DefaultValue; | ||
| size_t numWorkGroups = (size + (workGroupSize * itemsPerWorkItem) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const
| template <ReductionType Type, typename AccT, typename VecT, typename OpT> void launchReduction(AccT* result, const VecT *buffer, size_t size, OpT operation, bool overrideResult, void* streamPtr) { | ||
|
|
||
| constexpr auto DefaultValue = neutral<Type, AccT>(); | ||
| constexpr size_t workGroupSize = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 256 and not 1024? (or are you at the bandwidth already like that?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and for higher numbers, there is no real improvement on PVC at least.
algorithms/sycl/Reduction.cpp
Outdated
| // Explicity pass MO to load | ||
| // AccT expected = atomic.load(MO); | ||
|
|
||
| while(true){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make while(true) { if (condition) break; -> while(!condition)
algorithms/sycl/Reduction.cpp
Outdated
| // Using our own CAS loop | ||
| // AccT expected = atomic.load(MO); | ||
|
|
||
| while(true){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as in L48
The improvement numbers I mentioned are for vector sizes of 1e4, and 1e5. I have not talked about 1e8 because that sounds unrealistic. For 1e8, 1e9 vector sizes, it shows ~ 130 times improvement, and that is in most cases not really applicable in production scenarios for us. We mostly do GTS these days only for scaling studies.
EDIT: I tried it on my workstation with
and it seems that
|
It seems to be present throughout the project. I will check these later with any relevant benchmarks, and if we decide to remove them, I will remove them across the project.
I seem to have missed this barrier. Removed it, thank you! |
I know about the GTS thing. :) Ok; if that already helps with 1e4/1e5 (as... almost to be expected), then we should also adjust the CUDA/HIP kernels; probably the |
|
Also the failing test can be ignored; that's due to something running out of space in the GHA CI side. Not sure if we can really do anything about it at this point. |
|
@davschneller, just FYI, I tried the acpp installation locally again, and with the updated acpp, |
Hmm... Since not many people use the SYCL variant for NVIDIA hardware probably anyways, I think we can make the switch. |
I think my message was unclear. What I meant to say was that we need to update the acpp to something more recent. The recent version of it from their |
Review is taken care of, and Carsten's comments are implemented.
|
Yes recent versions of AdaptiveCpp have cmpxchg emulation when native fetch_min/max is not available. |
Modified sycl reduction algorithm based on @uphoffc's recommendation. Tested the fetch_add implementation and benchmarked it with the current speed on the PVC machine; the average runtime is around 1.5x-4x faster just for sycl reduction on the PVC machines. I did not test it with SeisSol, and as reduction is only a small component of SeisSol, I do not anticipate any significant speed-up in SeisSol runs due to this.
One problem was the
fetch_max()andfetch_min()methods with sm60, which throw some errors with registers, about which I am not entirely clear. I attempted to implement a manual CAS loop to overcome this. If anyone has ideas on how to surpass this, please let me know.The current failed CI test is due to some docker issue, which I am not clear about.
Can we also add the tests to the CI? There is a folder called
testswhose tests are not being called in CI as of now, if I understand correctly.