-
Notifications
You must be signed in to change notification settings - Fork 36
Replace manual 16x loop unrolling in _TallyMutationReferences_FAST with compiler-directed unrolling #596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace manual 16x loop unrolling in _TallyMutationReferences_FAST with compiler-directed unrolling #596
Changes from all commits
1b34527
570ca91
b934175
be64598
aa95990
8198ecc
07686d4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7155,6 +7155,9 @@ void Population::TallyMutationReferencesAcrossHaplosomes(const Haplosome * const | |
| // the mutation run tallying itself, however; instead, the caller can tally mutation runs | ||
| // across whatever set of subpops/haplosomes they wish, and then this method will provide | ||
| // mutation tallies given that choice. | ||
| #if defined(__GNUC__) && !defined(__clang__) | ||
| __attribute__((optimize("unroll-loops"))) | ||
| #endif | ||
| void Population::_TallyMutationReferences_FAST_FromMutationRunUsage(bool p_clock_for_mutrun_experiments) | ||
| { | ||
| // first zero out the refcounts in all registered Mutation objects | ||
|
|
@@ -7189,36 +7192,21 @@ void Population::_TallyMutationReferences_FAST_FromMutationRunUsage(bool p_clock | |
| // to put the refcounts for different mutations into different memory blocks | ||
| // according to the thread that manages each mutation. | ||
|
|
||
| const MutationIndex *mutrun_iter = mutrun->begin_pointer_const(); | ||
| const MutationIndex *mutrun_end_iter = mutrun->end_pointer_const(); | ||
|
|
||
| // I've gone back and forth on unrolling this loop. This ought to be done | ||
| // by the compiler, and the best unrolling strategy depends on the platform. | ||
| // But the compiler doesn't seem to do it, for my macOS system at least, or | ||
| // doesn't do it well; this increases speed by ~5% here. I'm not sure if | ||
| // clang is being dumb, or what, but it seems worthwhile. | ||
| while (mutrun_iter + 16 < mutrun_end_iter) | ||
| { | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| } | ||
|
|
||
| // Loop unrolling is enabled via the function attribute above (GCC) or | ||
| // pragma below (Clang). These are both scoped: the attribute applies only | ||
| // to this function, and the pragma applies only to the immediately | ||
| // following loop. The __restrict__ qualifiers indicate no pointer | ||
| // aliasing, helping the compiler optimize. This replaces previous manual | ||
| // 16x unrolling; the compiler now chooses the optimal unroll factor. | ||
| const MutationIndex * __restrict__ mutrun_iter = mutrun->begin_pointer_const(); | ||
| const MutationIndex * __restrict__ mutrun_end_iter = mutrun->end_pointer_const(); | ||
| slim_refcount_t * __restrict__ refcounts = refcount_block_ptr; | ||
|
|
||
| #if defined(__clang__) | ||
| #pragma clang loop unroll(enable) | ||
| #endif | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this need to get switched off again also, or is it enabled just for the immediately following loop?
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nope, this only applies to the immediately following loop - it doesn't persist |
||
| while (mutrun_iter != mutrun_end_iter) | ||
| *(refcount_block_ptr + (*mutrun_iter++)) += use_count; | ||
| *(refcounts + (*mutrun_iter++)) += use_count; | ||
| } | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to get changed back after the function; it shouldn't remain switched on for the rest of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a function attribute - it only affects the single function it decorates, not subsequent functions in the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, I didn't know there was such a thing! :->