Replace manual 16x loop unrolling in _TallyMutationReferences_FAST with compiler-directed unrolling #596

andrewkern · 2025-12-18T03:59:53Z

Remove 30+ lines of hand-unrolled code in _TallyMutationReferences_FAST
GCC/MinGW: attribute((optimize("unroll-loops"))) on function
Clang: #pragma clang loop unroll(enable) on loop
Compilers auto-detect optimal unroll factor (~8x)
Targeted approach only affects this function, not all loops
Verified on Linux GCC, MinGW64, and Apple Clang; no performance regression

Replace the 16x manually unrolled loop in _TallyMutationReferences_FAST_FromMutationRunUsage() with compiler-directed unrolling via EIDOS_PRAGMA_UNROLL_16. The new implementation: - Adds EIDOS_PRAGMA_UNROLL_16 macro to eidos_globals.h (GCC 8+/Clang) - Uses __restrict__ qualifiers to indicate no pointer aliasing - Uses index-based loop with explicit count for clearer loop bounds - Reduces code from 30+ lines to 11 lines Verified that all three target compilers generate 16x unrolled assembly: - Linux GCC 11 (x86_64) - MinGW GCC 10 (Windows x86_64 cross-compile) - Apple Clang 17 (ARM64) Benchmarks show equivalent performance to manual unrolling (~1.1s for 500 generations with 10K individuals and ~22K mutations).

…zation - Rename EIDOS_PRAGMA_UNROLL_16 to EIDOS_UNROLL_AUTO - Clang: use #pragma clang loop unroll(enable) to auto-choose factor - GCC/MinGW: fall back to explicit #pragma GCC unroll 16 - Clang auto-chooses 8x; GCC/MinGW use 16x (both valid for this memory-bound loop)

- Add -funroll-loops for GCC/MinGW in CMakeLists.txt - Add -mllvm -unroll-runtime for Clang in CMakeLists.txt - Remove EIDOS_UNROLL_AUTO macro from eidos_globals.h - Remove pragma from _TallyMutationReferences_FAST loop Both compilers now auto-detect optimal unroll factor (~8x). Benchmarks confirm no performance regression.

bhaller · 2025-12-18T14:11:20Z

Hi @andrewkern! We don't want this to be a global compile setting. Unrolling loops is a mixed bag; sometimes it produces significant speeds, but sometimes it makes the code actually slower (primarily because the code size is larger, which can cause cache thrash, I think), and it balloons the size of the executable which can be a problem. Usually the compiler makes good decisions about loop unrolling without needing to be instructed, and usually you don't want to interfere with that. But in this specific place, some compilers seem to make a poor choice, and thus the hand-unroll of the loop produced a speedup. So I think what we want is a directed fix centered on that one loop that used to be hand-unrolled; I think this compile flag should be turned on/off around that specific method. (Hopefully that works.)

It's an interesting question whether this is true for anywhere else in the code. It would be interesting, for example, to see whether this unroll flag would make your SIMD loops faster; you might try running your SIMD benchmarks with the flag on or off for that whole file, and see whether it makes a difference. But for this PR let's stay focused on _TallyMutationReferences_FAST_FromMutationRunUsage(). The __restrict__ changes you made might also be helpful, I'd suggest keeping those. Sound good?

andrewkern · 2025-12-18T14:51:50Z

okay so for this loop is sounds like we might go back to the if def pragma switch I had iterated on yesterday.

So the code pattern would be like e.g.,

#if defined(__clang__)
  #pragma clang loop unroll(enable) //Clang auto selects
#elif defined(__GNUC__)
  #pragma GCC unroll 16  // GCC we set unroll factor 16
#endif
for (int32_t i = 0; i < mutrun_count; ++i)
    refcounts[indices[i]] += use_count;

alternatively I believe we could apply unroll-loops to this whole function alone in GCC using this following syntax, and use the pragma above for clang (clang doesn't have the __attribute__((optimize("unroll-loops"))) thing...)

#if defined(__GNUC__) && !defined(__clang__)
__attribute__((optimize("unroll-loops")))
#endif
void Population::_TallyMutationReferences_FAST_FromMutationRunUsage(bool p_clock_for_mutrun_experiments)
{
    // ... earlier code ...

    #if defined(__clang__)
      #pragma clang loop unroll(enable)
    #endif
    for (int32_t i = 0; i < mutrun_count; ++i)
        refcounts[indices[i]] += use_count;

    // ...
}

which of these appeal more to you @bhaller ?

bhaller · 2025-12-18T16:28:26Z

I like the second one more, because I don't want to specify the number of iterations to unroll. But I'm surprised that Clang doesn't support that syntax that GCC supports; I thought that one of Clang's design stakes was that they support all the same options that GCC supports, so that they're a drop-in replacement for GCC? But maybe that doesn't extend to every dusty corner like this. :-> So sure, if solution 2 is the cleanest syntax we can get for unrolling in both compilers, with no specified number of unrolls, then let's do that.

- Remove global -funroll-loops/-mllvm flags from CMakeLists.txt - Add __attribute__((optimize("unroll-loops"))) for GCC/MinGW - Add #pragma clang loop unroll(enable) for Clang - Only affects _TallyMutationReferences_FAST, not all loops Both compilers auto-detect optimal unroll factor (~8x). No performance regression.

bhaller · 2025-12-18T18:28:52Z

core/population.cpp

 // mutation tallies given that choice.
+#if defined(__GNUC__) && !defined(__clang__)
+__attribute__((optimize("unroll-loops")))
+#endif


This needs to get changed back after the function; it shouldn't remain switched on for the rest of the file

this is a function attribute - it only affects the single function it decorates, not subsequent functions in the file

wow, I didn't know there was such a thing! :->

bhaller · 2025-12-18T18:29:34Z

core/population.cpp

+
+#if defined(__clang__)
+#pragma clang loop unroll(enable)
+#endif


Does this need to get switched off again also, or is it enabled just for the immediately following loop?

nope, this only applies to the immediately following loop - it doesn't persist

bhaller · 2025-12-18T18:31:44Z

core/population.cpp

+#pragma clang loop unroll(enable)
+#endif
+				for (int32_t i = 0; i < mutrun_count; ++i)
+					refcounts[indices[i]] += use_count;


Hmm. So, you've switched the loop to use an index i instead of the original pointer-range loop. The claim made by the C++ STL folks is that pointer-range loops are actually faster (and that therefore C++ for loops using std::iterator are faster than old-style for loops that count from 0 to size-1). So I'd suggest changing the style of this loop back, unless you have performance data indicating that this style is actually superior...?

okay, changed back!

- Restore the original pointer-range while loop instead of index-based for loop - Update comment to clarify that the function attribute (GCC) only affects this function, and the pragma (Clang) only affects the immediately following loop - Both loop styles produce identical 8x unrolled assembly with GCC; pointer-range preferred for consistency with STL iterator patterns

bhaller · 2025-12-18T20:11:03Z

LGTM. There was a conflict in VERSIONS since something else of yours got merged first; I resolved it. I'll merge this as soon as it's done verifying. Thanks!

bhaller · 2025-12-18T20:12:57Z

Actually maybe you beat me to it with the VERSIONS fix? Not sure what happened there, lol.

andrewkern · 2025-12-18T20:43:34Z

i think it's good now?

bhaller · 2025-12-18T20:51:29Z

Yep! Thanks, merged!

bhaller · 2025-12-23T18:47:48Z

Hi @andrewkern. I think I'm going to back out this change. I'm getting a warning on macOS, "Loop not unrolled: the optimizer was unable to perform the requested transformation". Clang just seems to struggle with unrolling this loop, which is why I had it hand-unrolled in the first place. I was optimistic that your changes would get it to co-operate, but it looks like that is not the case. :-< Sorry – I know you put some work into this!

andrewkern added 3 commits December 17, 2025 19:07

andrewkern marked this pull request as ready for review December 18, 2025 05:16

bhaller reviewed Dec 18, 2025

View reviewed changes

andrewkern and others added 3 commits December 18, 2025 11:55

update VERSIONS

8198ecc

Merge branch 'master' into TallyMutationReferences_FAST

07686d4

bhaller merged commit befa927 into MesserLab:master Dec 18, 2025
17 checks passed

Replace manual 16x loop unrolling in _TallyMutationReferences_FAST with compiler-directed unrolling #596

Replace manual 16x loop unrolling in _TallyMutationReferences_FAST with compiler-directed unrolling #596

Uh oh!

Conversation

andrewkern commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhaller commented Dec 18, 2025

Uh oh!

andrewkern commented Dec 18, 2025

Uh oh!

bhaller commented Dec 18, 2025

Uh oh!

bhaller Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

andrewkern Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

bhaller Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

bhaller Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

andrewkern Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

bhaller Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

andrewkern Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

bhaller commented Dec 18, 2025

Uh oh!

bhaller commented Dec 18, 2025

Uh oh!

andrewkern commented Dec 18, 2025

Uh oh!

Uh oh!

bhaller commented Dec 18, 2025

Uh oh!

bhaller commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andrewkern commented Dec 18, 2025 •

edited

Loading