Skip to content

Conversation

@andrewkern
Copy link
Collaborator

@andrewkern andrewkern commented Dec 18, 2025

  • Remove 30+ lines of hand-unrolled code in _TallyMutationReferences_FAST
  • GCC/MinGW: attribute((optimize("unroll-loops"))) on function
  • Clang: #pragma clang loop unroll(enable) on loop
  • Compilers auto-detect optimal unroll factor (~8x)
  • Targeted approach only affects this function, not all loops
  • Verified on Linux GCC, MinGW64, and Apple Clang; no performance regression

Replace the 16x manually unrolled loop in _TallyMutationReferences_FAST_FromMutationRunUsage()
with compiler-directed unrolling via EIDOS_PRAGMA_UNROLL_16. The new implementation:

- Adds EIDOS_PRAGMA_UNROLL_16 macro to eidos_globals.h (GCC 8+/Clang)
- Uses __restrict__ qualifiers to indicate no pointer aliasing
- Uses index-based loop with explicit count for clearer loop bounds
- Reduces code from 30+ lines to 11 lines

Verified that all three target compilers generate 16x unrolled assembly:
- Linux GCC 11 (x86_64)
- MinGW GCC 10 (Windows x86_64 cross-compile)
- Apple Clang 17 (ARM64)

Benchmarks show equivalent performance to manual unrolling (~1.1s for 500
generations with 10K individuals and ~22K mutations).
…zation

- Rename EIDOS_PRAGMA_UNROLL_16 to EIDOS_UNROLL_AUTO
- Clang: use #pragma clang loop unroll(enable) to auto-choose factor
- GCC/MinGW: fall back to explicit #pragma GCC unroll 16
- Clang auto-chooses 8x; GCC/MinGW use 16x (both valid for this memory-bound loop)
- Add -funroll-loops for GCC/MinGW in CMakeLists.txt
- Add -mllvm -unroll-runtime for Clang in CMakeLists.txt
- Remove EIDOS_UNROLL_AUTO macro from eidos_globals.h
- Remove pragma from _TallyMutationReferences_FAST loop

Both compilers now auto-detect optimal unroll factor (~8x).
Benchmarks confirm no performance regression.
@andrewkern andrewkern marked this pull request as ready for review December 18, 2025 05:16
@bhaller
Copy link
Contributor

bhaller commented Dec 18, 2025

Hi @andrewkern! We don't want this to be a global compile setting. Unrolling loops is a mixed bag; sometimes it produces significant speeds, but sometimes it makes the code actually slower (primarily because the code size is larger, which can cause cache thrash, I think), and it balloons the size of the executable which can be a problem. Usually the compiler makes good decisions about loop unrolling without needing to be instructed, and usually you don't want to interfere with that. But in this specific place, some compilers seem to make a poor choice, and thus the hand-unroll of the loop produced a speedup. So I think what we want is a directed fix centered on that one loop that used to be hand-unrolled; I think this compile flag should be turned on/off around that specific method. (Hopefully that works.)

It's an interesting question whether this is true for anywhere else in the code. It would be interesting, for example, to see whether this unroll flag would make your SIMD loops faster; you might try running your SIMD benchmarks with the flag on or off for that whole file, and see whether it makes a difference. But for this PR let's stay focused on _TallyMutationReferences_FAST_FromMutationRunUsage(). The __restrict__ changes you made might also be helpful, I'd suggest keeping those. Sound good?

@andrewkern
Copy link
Collaborator Author

okay so for this loop is sounds like we might go back to the if def pragma switch I had iterated on yesterday.

So the code pattern would be like e.g.,

#if defined(__clang__)
  #pragma clang loop unroll(enable) //Clang auto selects
#elif defined(__GNUC__)
  #pragma GCC unroll 16  // GCC we set unroll factor 16
#endif
for (int32_t i = 0; i < mutrun_count; ++i)
    refcounts[indices[i]] += use_count;

alternatively I believe we could apply unroll-loops to this whole function alone in GCC using this following syntax, and use the pragma above for clang (clang doesn't have the __attribute__((optimize("unroll-loops"))) thing...)

#if defined(__GNUC__) && !defined(__clang__)
__attribute__((optimize("unroll-loops")))
#endif
void Population::_TallyMutationReferences_FAST_FromMutationRunUsage(bool p_clock_for_mutrun_experiments)
{
    // ... earlier code ...

    #if defined(__clang__)
      #pragma clang loop unroll(enable)
    #endif
    for (int32_t i = 0; i < mutrun_count; ++i)
        refcounts[indices[i]] += use_count;

    // ...
}

which of these appeal more to you @bhaller ?

@bhaller
Copy link
Contributor

bhaller commented Dec 18, 2025

I like the second one more, because I don't want to specify the number of iterations to unroll. But I'm surprised that Clang doesn't support that syntax that GCC supports; I thought that one of Clang's design stakes was that they support all the same options that GCC supports, so that they're a drop-in replacement for GCC? But maybe that doesn't extend to every dusty corner like this. :-> So sure, if solution 2 is the cleanest syntax we can get for unrolling in both compilers, with no specified number of unrolls, then let's do that.

- Remove global -funroll-loops/-mllvm flags from CMakeLists.txt
- Add __attribute__((optimize("unroll-loops"))) for GCC/MinGW
- Add #pragma clang loop unroll(enable) for Clang
- Only affects _TallyMutationReferences_FAST, not all loops

Both compilers auto-detect optimal unroll factor (~8x).
No performance regression.
// mutation tallies given that choice.
#if defined(__GNUC__) && !defined(__clang__)
__attribute__((optimize("unroll-loops")))
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to get changed back after the function; it shouldn't remain switched on for the rest of the file

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a function attribute - it only affects the single function it decorates, not subsequent functions in the file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, I didn't know there was such a thing! :->


#if defined(__clang__)
#pragma clang loop unroll(enable)
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to get switched off again also, or is it enabled just for the immediately following loop?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, this only applies to the immediately following loop - it doesn't persist

#pragma clang loop unroll(enable)
#endif
for (int32_t i = 0; i < mutrun_count; ++i)
refcounts[indices[i]] += use_count;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. So, you've switched the loop to use an index i instead of the original pointer-range loop. The claim made by the C++ STL folks is that pointer-range loops are actually faster (and that therefore C++ for loops using std::iterator are faster than old-style for loops that count from 0 to size-1). So I'd suggest changing the style of this loop back, unless you have performance data indicating that this style is actually superior...?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, changed back!

andrewkern and others added 3 commits December 18, 2025 11:55
- Restore the original pointer-range while loop instead of index-based for loop
- Update comment to clarify that the function attribute (GCC) only affects this
  function, and the pragma (Clang) only affects the immediately following loop
- Both loop styles produce identical 8x unrolled assembly with GCC; pointer-range
  preferred for consistency with STL iterator patterns
@bhaller
Copy link
Contributor

bhaller commented Dec 18, 2025

LGTM. There was a conflict in VERSIONS since something else of yours got merged first; I resolved it. I'll merge this as soon as it's done verifying. Thanks!

@bhaller
Copy link
Contributor

bhaller commented Dec 18, 2025

Actually maybe you beat me to it with the VERSIONS fix? Not sure what happened there, lol.

@andrewkern
Copy link
Collaborator Author

i think it's good now?

@bhaller bhaller merged commit befa927 into MesserLab:master Dec 18, 2025
17 checks passed
@bhaller
Copy link
Contributor

bhaller commented Dec 18, 2025

Yep! Thanks, merged!

@bhaller
Copy link
Contributor

bhaller commented Dec 23, 2025

Hi @andrewkern. I think I'm going to back out this change. I'm getting a warning on macOS, "Loop not unrolled: the optimizer was unable to perform the requested transformation". Clang just seems to struggle with unrolling this loop, which is why I had it hand-unrolled in the first place. I was optimistic that your changes would get it to co-operate, but it looks like that is not the case. :-< Sorry – I know you put some work into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants