Modified sycl reduction algorithm #56

vikaskurapati · 2025-12-08T09:18:42Z

Modified sycl reduction algorithm based on @uphoffc's recommendation. Tested the fetch_add implementation and benchmarked it with the current speed on the PVC machine; the average runtime is around 1.5x-4x faster just for sycl reduction on the PVC machines. I did not test it with SeisSol, and as reduction is only a small component of SeisSol, I do not anticipate any significant speed-up in SeisSol runs due to this.

One problem was the fetch_max() and fetch_min() methods with sm60, which throw some errors with registers, about which I am not entirely clear. I attempted to implement a manual CAS loop to overcome this. If anyone has ideas on how to surpass this, please let me know.

The current failed CI test is due to some docker issue, which I am not clear about.

Can we also add the tests to the CI? There is a folder called tests whose tests are not being called in CI as of now, if I understand correctly.

uphoffc

ntload(&buffer[id]);

Just use a regular load unless you have evidence that the specific "ntload" boosts performance across various platforms.

idx.barrier(sycl::access::fence_space::local_space);

Not necessary, can be removed.

davschneller

Thanks for updating it like this. 10e7 plasticity cells might still occur at least in GTS. :D

So, ACPP (presumably) doesn't have a wrapper for CUDA yet in these min/max cases? Does it work with sm_75? (or with integers?)

Running the tests themselves is alas not possible right now—I've set up a self-hosted runner for that which should manage running some basic NVIDIA/Intel tests, but it doesn't work quite yet (if it does, I'll also run the respective SeisSol tests).

davschneller · 2025-12-08T13:29:15Z

algorithms/sycl/Reduction.cpp

+          idx.barrier(sycl::access::fence_space::local_space);

-              value = sycl::reduce_over_group(subgroup, value, operation);
+          auto reducedValue = sycl::reduce_over_group(idx.get_group(), threadAcc, operation);


davschneller · 2025-12-08T13:29:52Z

algorithms/sycl/Reduction.cpp

-          for (std::size_t i = currentWarp; i < warpsNeeded; i += warpCount) {
-            const auto id = threadInWarp + i * sgSize;
-            auto value = (id < size) ? static_cast<AccT>(ntload(&buffer[id])) : DefaultValue;
+      size_t numWorkGroups = (size + (workGroupSize * itemsPerWorkItem) - 1)


davschneller · 2025-12-08T13:30:58Z

algorithms/sycl/Reduction.cpp

  template <ReductionType Type, typename AccT, typename VecT, typename OpT> void launchReduction(AccT* result, const VecT *buffer, size_t size, OpT operation, bool overrideResult, void* streamPtr) {

    constexpr auto DefaultValue = neutral<Type, AccT>();
+    constexpr size_t workGroupSize = 256;


Why 256 and not 1024? (or are you at the bandwidth already like that?)

Yes, and for higher numbers, there is no real improvement on PVC at least.

davschneller · 2025-12-08T14:43:28Z

algorithms/sycl/Reduction.cpp

+      // Explicity pass MO to load
+      // AccT expected = atomic.load(MO);
+
+      while(true){


Make while(true) { if (condition) break; -> while(!condition)

davschneller · 2025-12-08T14:43:36Z

algorithms/sycl/Reduction.cpp

+      // Using our own CAS loop
+      // AccT expected = atomic.load(MO);
+
+      while(true){


Same as in L48

vikaskurapati · 2025-12-09T15:49:40Z

Thanks for updating it like this. 10e7 plasticity cells might still occur at least in GTS. :D

The improvement numbers I mentioned are for vector sizes of 1e4, and 1e5. I have not talked about 1e8 because that sounds unrealistic. For 1e8, 1e9 vector sizes, it shows ~ 130 times improvement, and that is in most cases not really applicable in production scenarios for us. We mostly do GTS these days only for scaling studies.

So, ACPP (presumably) doesn't have a wrapper for CUDA yet in these min/max cases? Does it work with sm_75? (or with integers?)

EDIT: I tried it on my workstation with

acpp -o test test.cpp -O2 --acpp-clang=/home/ge26cet/bin/clang++ --acpp-targets=cuda:sm_75 --cuda-path=/work/ge26cet/cuda-fix -lcudart

and it seems that fetch_min() is not there for both int and float at least here.

Running the tests themselves is alas not possible right now—I've set up a self-hosted runner for that which should manage running some basic NVIDIA/Intel tests, but it doesn't work quite yet (if it does, I'll also run the respective SeisSol tests).

vikaskurapati · 2025-12-09T16:15:16Z

ntload(&buffer[id]);

Just use a regular load unless you have evidence that the specific "ntload" boosts performance across various platforms.

It seems to be present throughout the project. I will check these later with any relevant benchmarks, and if we decide to remove them, I will remove them across the project.

idx.barrier(sycl::access::fence_space::local_space);

Not necessary, can be removed.

I seem to have missed this barrier. Removed it, thank you!

davschneller · 2025-12-09T16:29:32Z

Thanks for updating it like this. 10e7 plasticity cells might still occur at least in GTS. :D

The improvement numbers I mentioned are for vector sizes of 1e4, and 1e5. I have not talked about 1e8 because that sounds unrealistic. For 1e8, 1e9 vector sizes, it shows ~ 130 times improvement, and that is in most cases not really applicable in production scenarios for us. We mostly do GTS these days only for scaling studies.

I know about the GTS thing. :)
But still, it's also important there IMO.

Ok; if that already helps with 1e4/1e5 (as... almost to be expected), then we should also adjust the CUDA/HIP kernels; probably the atomicUpdate can just be copied there and with adjusted method calls. Replacing the last stores in the existing kernels should suffice (the group reductions might be present in AMD or NVIDIA libraries, but IIRC not in "common" HIP/CUDA).

davschneller · 2025-12-09T16:30:15Z

Also the failing test can be ignored; that's due to something running out of space in the GHA CI side. Not sure if we can really do anything about it at this point.

vikaskurapati · 2025-12-12T14:46:11Z

@davschneller, just FYI, I tried the acpp installation locally again, and with the updated acpp, fetch_min() and fetch_max() instructions seem to be working with sm_75. Once we update them on CI, we could remove the manual CAS loops in this reduction method.

davschneller · 2025-12-13T00:52:47Z

@davschneller, just FYI, I tried the acpp installation locally again, and with the updated acpp, fetch_min() and fetch_max() instructions seem to be working with sm_75. Once we update them on CI, we could remove the manual CAS loops in this reduction method.

Hmm... Since not many people use the SYCL variant for NVIDIA hardware probably anyways, I think we can make the switch.
However, for the pure CUDA variant, I'd keep it at sm_60 until we release a SeisSol 1.4.0. Then we can drop everything below sm_75, just like CUDA 13 does.

vikaskurapati · 2025-12-15T08:52:54Z

@davschneller, just FYI, I tried the acpp installation locally again, and with the updated acpp, fetch_min() and fetch_max() instructions seem to be working with sm_75. Once we update them on CI, we could remove the manual CAS loops in this reduction method.

Hmm... Since not many people use the SYCL variant for NVIDIA hardware probably anyways, I think we can make the switch. However, for the pure CUDA variant, I'd keep it at sm_60 until we release a SeisSol 1.4.0. Then we can drop everything below sm_75, just like CUDA 13 does.

I think my message was unclear. What I meant to say was that we need to update the acpp to something more recent. The recent version of it from their develop branch worked out fine with fetch_min() test case. I did not test it, but @uphoffc mentioned that he was successful in compiling a sm_60 version too.

Review is taken care of, and Carsten's comments are implemented.

uphoffc · 2025-12-15T08:54:42Z

Yes recent versions of AdaptiveCpp have cmpxchg emulation when native fetch_min/max is not available.

vikaskurapati added 8 commits December 8, 2025 08:50

reduction algorithm for sycl

eddf23f

sycl reduction fixes

8b6a5db

fix sycl typos for reduction

96dadd8

fix atomic results

abbea7e

manual atomics for min, max

d856ef5

passing memory order to try and fix reduction issues on sm60

f3db67c

remove atomic load for sm60

4554dd1

correction in logic for min, max

c1d2b30

vikaskurapati requested review from davschneller and uphoffc December 8, 2025 09:18

uphoffc previously requested changes Dec 8, 2025

View reviewed changes

davschneller reviewed Dec 8, 2025

View reviewed changes

vikaskurapati added 2 commits December 9, 2025 16:50

Address review for reduction

2efba3c

Putting ntload back, and remove the missed barrier

199df0e

vikaskurapati requested review from davschneller and uphoffc December 9, 2025 16:15

davschneller approved these changes Dec 9, 2025

View reviewed changes

removed ntload in sycl/reduction

96d0a60

vikaskurapati merged commit cd7a6d2 into master Dec 15, 2025
14 checks passed

vikaskurapati deleted the vikas/sycl-reduce branch December 15, 2025 08:53

Modified sycl reduction algorithm #56

Modified sycl reduction algorithm #56

Uh oh!

Conversation

vikaskurapati commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

uphoffc left a comment

Choose a reason for hiding this comment

Uh oh!

davschneller left a comment

Choose a reason for hiding this comment

Uh oh!

davschneller Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

davschneller Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

davschneller Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

vikaskurapati Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

davschneller Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

davschneller Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

vikaskurapati commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vikaskurapati commented Dec 9, 2025

Uh oh!

davschneller commented Dec 9, 2025

Uh oh!

davschneller commented Dec 9, 2025

Uh oh!

vikaskurapati commented Dec 12, 2025

Uh oh!

davschneller commented Dec 13, 2025

Uh oh!

vikaskurapati commented Dec 15, 2025

Uh oh!

Uh oh!

uphoffc commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vikaskurapati commented Dec 8, 2025 •

edited

Loading

vikaskurapati commented Dec 9, 2025 •

edited

Loading