-
Notifications
You must be signed in to change notification settings - Fork 36
Replace manual 16x loop unrolling in _TallyMutationReferences_FAST with compiler-directed unrolling #596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace manual 16x loop unrolling in _TallyMutationReferences_FAST with compiler-directed unrolling #596
Conversation
Replace the 16x manually unrolled loop in _TallyMutationReferences_FAST_FromMutationRunUsage() with compiler-directed unrolling via EIDOS_PRAGMA_UNROLL_16. The new implementation: - Adds EIDOS_PRAGMA_UNROLL_16 macro to eidos_globals.h (GCC 8+/Clang) - Uses __restrict__ qualifiers to indicate no pointer aliasing - Uses index-based loop with explicit count for clearer loop bounds - Reduces code from 30+ lines to 11 lines Verified that all three target compilers generate 16x unrolled assembly: - Linux GCC 11 (x86_64) - MinGW GCC 10 (Windows x86_64 cross-compile) - Apple Clang 17 (ARM64) Benchmarks show equivalent performance to manual unrolling (~1.1s for 500 generations with 10K individuals and ~22K mutations).
…zation - Rename EIDOS_PRAGMA_UNROLL_16 to EIDOS_UNROLL_AUTO - Clang: use #pragma clang loop unroll(enable) to auto-choose factor - GCC/MinGW: fall back to explicit #pragma GCC unroll 16 - Clang auto-chooses 8x; GCC/MinGW use 16x (both valid for this memory-bound loop)
- Add -funroll-loops for GCC/MinGW in CMakeLists.txt - Add -mllvm -unroll-runtime for Clang in CMakeLists.txt - Remove EIDOS_UNROLL_AUTO macro from eidos_globals.h - Remove pragma from _TallyMutationReferences_FAST loop Both compilers now auto-detect optimal unroll factor (~8x). Benchmarks confirm no performance regression.
|
Hi @andrewkern! We don't want this to be a global compile setting. Unrolling loops is a mixed bag; sometimes it produces significant speeds, but sometimes it makes the code actually slower (primarily because the code size is larger, which can cause cache thrash, I think), and it balloons the size of the executable which can be a problem. Usually the compiler makes good decisions about loop unrolling without needing to be instructed, and usually you don't want to interfere with that. But in this specific place, some compilers seem to make a poor choice, and thus the hand-unroll of the loop produced a speedup. So I think what we want is a directed fix centered on that one loop that used to be hand-unrolled; I think this compile flag should be turned on/off around that specific method. (Hopefully that works.) It's an interesting question whether this is true for anywhere else in the code. It would be interesting, for example, to see whether this unroll flag would make your SIMD loops faster; you might try running your SIMD benchmarks with the flag on or off for that whole file, and see whether it makes a difference. But for this PR let's stay focused on |
|
okay so for this loop is sounds like we might go back to the So the code pattern would be like e.g., #if defined(__clang__)
#pragma clang loop unroll(enable) //Clang auto selects
#elif defined(__GNUC__)
#pragma GCC unroll 16 // GCC we set unroll factor 16
#endif
for (int32_t i = 0; i < mutrun_count; ++i)
refcounts[indices[i]] += use_count;alternatively I believe we could apply #if defined(__GNUC__) && !defined(__clang__)
__attribute__((optimize("unroll-loops")))
#endif
void Population::_TallyMutationReferences_FAST_FromMutationRunUsage(bool p_clock_for_mutrun_experiments)
{
// ... earlier code ...
#if defined(__clang__)
#pragma clang loop unroll(enable)
#endif
for (int32_t i = 0; i < mutrun_count; ++i)
refcounts[indices[i]] += use_count;
// ...
}which of these appeal more to you @bhaller ? |
|
I like the second one more, because I don't want to specify the number of iterations to unroll. But I'm surprised that Clang doesn't support that syntax that GCC supports; I thought that one of Clang's design stakes was that they support all the same options that GCC supports, so that they're a drop-in replacement for GCC? But maybe that doesn't extend to every dusty corner like this. :-> So sure, if solution 2 is the cleanest syntax we can get for unrolling in both compilers, with no specified number of unrolls, then let's do that. |
- Remove global -funroll-loops/-mllvm flags from CMakeLists.txt
- Add __attribute__((optimize("unroll-loops"))) for GCC/MinGW
- Add #pragma clang loop unroll(enable) for Clang
- Only affects _TallyMutationReferences_FAST, not all loops
Both compilers auto-detect optimal unroll factor (~8x).
No performance regression.
| // mutation tallies given that choice. | ||
| #if defined(__GNUC__) && !defined(__clang__) | ||
| __attribute__((optimize("unroll-loops"))) | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to get changed back after the function; it shouldn't remain switched on for the rest of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a function attribute - it only affects the single function it decorates, not subsequent functions in the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, I didn't know there was such a thing! :->
|
|
||
| #if defined(__clang__) | ||
| #pragma clang loop unroll(enable) | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to get switched off again also, or is it enabled just for the immediately following loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, this only applies to the immediately following loop - it doesn't persist
core/population.cpp
Outdated
| #pragma clang loop unroll(enable) | ||
| #endif | ||
| for (int32_t i = 0; i < mutrun_count; ++i) | ||
| refcounts[indices[i]] += use_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. So, you've switched the loop to use an index i instead of the original pointer-range loop. The claim made by the C++ STL folks is that pointer-range loops are actually faster (and that therefore C++ for loops using std::iterator are faster than old-style for loops that count from 0 to size-1). So I'd suggest changing the style of this loop back, unless you have performance data indicating that this style is actually superior...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, changed back!
- Restore the original pointer-range while loop instead of index-based for loop - Update comment to clarify that the function attribute (GCC) only affects this function, and the pragma (Clang) only affects the immediately following loop - Both loop styles produce identical 8x unrolled assembly with GCC; pointer-range preferred for consistency with STL iterator patterns
|
LGTM. There was a conflict in VERSIONS since something else of yours got merged first; I resolved it. I'll merge this as soon as it's done verifying. Thanks! |
|
Actually maybe you beat me to it with the |
|
i think it's good now? |
|
Yep! Thanks, merged! |
|
Hi @andrewkern. I think I'm going to back out this change. I'm getting a warning on macOS, "Loop not unrolled: the optimizer was unable to perform the requested transformation". Clang just seems to struggle with unrolling this loop, which is why I had it hand-unrolled in the first place. I was optimistic that your changes would get it to co-operate, but it looks like that is not the case. :-< Sorry – I know you put some work into this! |
Uh oh!
There was an error while loading. Please reload this page.