eidos function SIMD optimizations #578

andrewkern · 2025-11-28T03:09:19Z

SIMD Vectorization for Eidos Math Functions

Summary

Add eidos/eidos_simd.h with SSE4.2/AVX2 implementations for math operations
Add -DUSE_SIMD=ON/OFF CMake option (default: ON) with automatic CPU detection
Update math functions to use SIMD as the non-OpenMP fallback path

Vectorized functions: sqrt, abs, floor, ceil, trunc, round, sum, product

AVX2 processes 4 doubles/instruction; SSE4.2 processes 2 doubles/instruction.

Benchmark Suite

`run_benchmarks.sh [num_runs]`

Builds SLiM with and without SIMD, runs benchmarks, reports speedup.

./run_benchmarks.sh      # 3 runs (default)
./run_benchmarks.sh 10   # 10 runs

`simd_benchmark.eidos`

Tests math functions on 1M element arrays (100 iterations each).

`slim_benchmark.slim`

Full simulation benchmark: N=5000, 1Mb chromosome, 5000 generations with selection.

Benchmark Results

$ simd_benchmarks/run_benchmarks.sh 
============================================
SIMD Benchmark Runner
============================================
SLiM root: /home/adkern/SLiM
Runs per benchmark: 3

Building with SIMD enabled...
  Done.
Building with SIMD disabled...
  Done.

============================================
Eidos Math Function Benchmarks
============================================

SIMD Build:
  Running Eidos benchmark (SIMD)...
    sqrt():    0.105 sec
    abs():     0.171 sec
    floor():   0.164 sec
    ceil():    0.166 sec
    round():   0.164 sec
    trunc():   0.165 sec
    sum():     0.032 sec
    product(): 0.003 sec (1000 elements, 10000 iters)

Scalar Build:
  Running Eidos benchmark (Scalar)...
    sqrt():    0.108 sec
    abs():     0.166 sec
    floor():   0.231 sec
    ceil():    0.246 sec
    round():   0.473 sec
    trunc():   0.246 sec
    sum():     0.166 sec
    product(): 0.017 sec (1000 elements, 10000 iters)

============================================
SLiM Simulation Benchmark
(N=5000, 5000 generations, selection)
============================================

Running 3 iterations each...

SIMD Build:   12.756s (avg)
Scalar Build: 12.316s (avg)

Speedup: .96x

============================================
Benchmark complete
============================================

bhaller · 2025-12-05T16:46:21Z

Hi @andrewkern! I've just gotten around to reviewing this PR. Overall it looks great! Can I ask you to do a couple of things?

change eidos_simd.h to credit you, not me :->
add a README file to simd_benchmarks giving credit to yourself, explaining the purpose of the directory, and mentioning that the files in it are not used in the build of SLiM but are simply provided for internal development use
it looks like GitHub's CI failed for several builds above, for reasons that I don't understand but have not dug into, so that needs to be fixed
there are various places where the tab indentation of blank lines was changed for no particular reason, making the diffs harder to read/review; it'd be nice if those got cleaned up to make the diffs minimal.

Besides those to-do items, I have a few thoughts as well. First of all, a very nice thing about OpenMP is that the same code path is used for both single-threaded and multi-threaded builds; the #pragmas for OpenMP just modify the behavior of an existing chunk of code to parallelize it. This means that there is only one code path to maintain, you just need to think about it in two ways (single- and multi-threaded). With these SIMD changes that is no longer the case, which strikes me as unfortunate and bug-prone. The code could be restructured so that the SIMD case is treated as the special case with #ifdef and the non-SIMD case, for both OpenMP and non-OpenMP, uses the same shared code path. But that is probably pointless and counter-productive, really, since 99% of compilers/hardware these days supports SIMD, I imagine. Perhaps a better solution would be to have the OpenMP path use your SIMD implementation as well. At the moment the OpenMP code paths request SIMD optimizations from OpenMP, but I think (?) that could be removed and it could use your SIMD code instead, right? Then all the different possibilities (SIMD present/absent, OpenMP present/absent) would all be handled by the same code path. What are your thoughts on that?

I'm also pondering the way you needed to change the self-tests to be tolerant of small errors. (1) Maybe there ought to be a built-in way in Eidos to compare for equality with a given tolerance? It comes up again and again in the self-tests. Maybe there's an R function I ought to emulate for this. Hmm, looks like all.equal() is in base R, I ought to add something like that. I just opened #580 to track this. (2) This tolerance issue is a good reason to use your SIMD code in the parallel case as well, rather than relying on OpenMP's SIMD support, I think. As far as I recall, OpenMP only actually does SIMD optimizations when the result is guaranteed to be identical, otherwise it declines. I seem to recall getting pretty unsatisfactory results from its SIMD feature as a result. Your SIMD approach is nice in that it just forces the SIMD optimization to happen; it isn't up to the conservative black-box heuristics of OpenMP or the compiler to make it happen. (I tried making some non-OpenMP loops in Eidos be SIMD-compliant in the hopes that the compiler would optimize them with SIMD automatically, but I think that it, too, declines to do so unless the result is exact.)

And lastly: are these all of the SIMD optimizations that are straightforward to do in Eidos, or was this just a pilot project? If there are more, it'd be great to get them all in. That work would only benefit models that actually call the Eidos functions in question; but many models do. But the above questions ought to be resolved before doing that work, and maybe that additional work ought to be a new PR. :->

Thanks a bunch! I'll be very happy to merge this once the above has been addressed! :->

bhaller · 2025-12-06T19:59:06Z

OK, I've implemented #580 to test for equality within a given tolerance. If you want to use that you can, but no worries; at some point I'll go through and switch over to using isClose()/allClose() in all of the unit tests that use a tolerance value, so I'll fix your tests along with all the others.

andrewkern · 2025-12-11T02:13:50Z

Progress update:

Hi @andrewkern! I've just gotten around to reviewing this PR. Overall it looks great! Can I ask you to do a couple of things?

change eidos_simd.h to credit you, not me :->

add a README file to simd_benchmarks giving credit to yourself, explaining the purpose of the directory, and mentioning that the files in it are not used in the build of SLiM but are simply provided for internal development use

it looks like GitHub's CI failed for several builds above, for reasons that I don't understand but have not dug into, so that needs to be fixed

This is a PITA. Apple silicon doesn't support SIMD in the same way that X86_64 based processors do.... I'm working on this. Should be doable.... I'll hopefully get this done soon

there are various places where the tab indentation of blank lines was changed for no particular reason, making the diffs harder to read/review; it'd be nice if those got cleaned up to make the diffs minimal.

More on your other questions/thoughts soon

bhaller · 2025-12-11T02:18:51Z

Progress update:

...This is a PITA. Apple silicon doesn't support SIMD in the same way that X86_64 based processors do.... I'm working on this. Should be doable.... I'll hopefully get this done soon

Apple has to be different. :-> Sorry, sounds like a PITA indeed. If it turns out not to be workable on Apple Silicon if can just get #ifdef'ed I suppose.

there are various places where the tab indentation of blank lines was changed for no particular reason, making the diffs harder to read/review; it'd be nice if those got cleaned up to make the diffs minimal.

If the whitespace issues are driving you crazy then forget about that. :->

More on your other questions/thoughts soon

OK. Thanks.

andrewkern · 2025-12-11T02:49:55Z

this is the ARM SIMD: https://developer.arm.com/Architectures/Neon

working on it now

bhaller · 2025-12-11T03:42:30Z

@andrewkern I had no idea you're such a bad-ass programmer! Kudos! Ping me when you're ready for another review of this PR. Maybe you should quit that Director job of yours and come work on SLiM with me! :->

andrewkern · 2025-12-11T05:35:41Z

@andrewkern I had no idea you're such a bad-ass programmer! Kudos! Ping me when you're ready for another review of this PR. Maybe you should quit that Director job of yours and come work on SLiM with me! :->

nah, i'm no badass, just a bit of a hardware junkie and willing to google 😄

it looks like this now passes macos-15 tests, but the macos-13 runner has been retired by GH: actions/runner-images#13046

I'm going to edit the workflow to move the macos-13 runner to macos-14 -- this is something you'll want to include in the master branch for sure

andrewkern · 2025-12-11T05:50:30Z

Besides those to-do items, I have a few thoughts as well. First of all, a very nice thing about OpenMP is that the same code path is used for both single-threaded and multi-threaded builds; the #pragmas for OpenMP just modify the behavior of an existing chunk of code to parallelize it. This means that there is only one code path to maintain, you just need to think about it in two ways (single- and multi-threaded). With these SIMD changes that is no longer the case, which strikes me as unfortunate and bug-prone. The code could be restructured so that the SIMD case is treated as the special case with #ifdef and the non-SIMD case, for both OpenMP and non-OpenMP, uses the same shared code path. But that is probably pointless and counter-productive, really, since 99% of compilers/hardware these days supports SIMD, I imagine. Perhaps a better solution would be to have the OpenMP path use your SIMD implementation as well. At the moment the OpenMP code paths request SIMD optimizations from OpenMP, but I think (?) that could be removed and it could use your SIMD code instead, right? Then all the different possibilities (SIMD present/absent, OpenMP present/absent) would all be handled by the same code path. What are your thoughts on that?

Good point. The current structure does create multiple code paths to maintain. Your suggestion to have OpenMP use the SIMD functions makes sense. The OpenMP #pragma omp parallel for simd currently asks the compiler to auto-vectorize, but that's redundant if we have explicit SIMD. We could restructure to:

#ifdef _OPENMP
    EIDOS_THREAD_COUNT(...);
    #pragma omp parallel for ...  // no simd clause
    for (int chunk = 0; chunk < num_chunks; chunk++)
        Eidos_SIMD::sqrt_float64(&input[chunk*size], &output[chunk*size], size);
#else
    Eidos_SIMD::sqrt_float64(input, output, count);
#endif

This way:

SIMD functions handle vectorization (with scalar fallback built-in)
OpenMP just handles thread parallelism when enabled
One SIMD implementation serves all cases
The SIMD functions already have scalar fallbacks, so SIMD present/absent is handled internally.

Want me to refactor it this way?

I'm also pondering the way you needed to change the self-tests to be tolerant of small errors. (1) Maybe there ought to be a built-in way in Eidos to compare for equality with a given tolerance? It comes up again and again in the self-tests. Maybe there's an R function I ought to emulate for this. Hmm, looks like all.equal() is in base R, I ought to add something like that. I just opened #580 to track this. (2) This tolerance issue is a good reason to use your SIMD code in the parallel case as well, rather than relying on OpenMP's SIMD support, I think. As far as I recall, OpenMP only actually does SIMD optimizations when the result is guaranteed to be identical, otherwise it declines. I seem to recall getting pretty unsatisfactory results from its SIMD feature as a result. Your SIMD approach is nice in that it just forces the SIMD optimization to happen; it isn't up to the conservative black-box heuristics of OpenMP or the compiler to make it happen. (I tried making some non-OpenMP loops in Eidos be SIMD-compliant in the hopes that the compiler would optimize them with SIMD automatically, but I think that it, too, declines to do so unless the result is exact.)

Good points on both:

~~I can switch over to the allEqual function in add Eidos equivalent of R's all.equal() for equality comparison within a given tolerance #580~~ this is now done ✅
Agreed. I can have the OpenMP path call the explicit SIMD functions rather than relying on #pragma omp simd. That way we get guaranteed vectorization regardless of compiler conservatism, and the parallel threads each process their chunk with explicit SIMD.

And lastly: are these all of the SIMD optimizations that are straightforward to do in Eidos, or was this just a pilot project? If there are more, it'd be great to get them all in. That work would only benefit models that actually call the Eidos functions in question; but many models do. But the above questions ought to be resolved before doing that work, and maybe that additional work ought to be a new PR. :->

This was a pilot to test the approach. There are more candidates - trig functions (sin, cos, tan), their inverses, and possibly some vector operations. Happy to add those in a follow-up PR once we've settled the OpenMP integration question and the CI is green on this one.

andrewkern · 2025-12-11T06:51:57Z

okay all tests now pass.

Adds compile-time SIMD detection (AVX2/SSE4.2/FMA) and vectorized implementations for sqrt, abs, floor, ceil, round, trunc, sum, and product. Benchmarks show 1.4-5.7x speedups on large float arrays. - CMakeLists.txt: add USE_SIMD option and compiler flag detection - eidos/eidos_simd.h: new header with SIMD intrinsic implementations - eidos/eidos_functions_math.cpp: use SIMD paths when available - eidos/eidos_test_*.{h,cpp}: use tolerance for float comparisons

Includes Eidos math function benchmark, SLiM simulation benchmark, and a runner script that builds both SIMD and scalar versions and compares performance.

bhaller · 2025-12-12T16:46:27Z

...Your suggestion to have OpenMP use the SIMD functions makes sense. The OpenMP #pragma omp parallel for simd currently asks the compiler to auto-vectorize, but that's redundant if we have explicit SIMD. We could restructure to:
...
Want me to refactor it this way?

Hmm. I'm going to say no. Your refactor is perfect and I'll do it eventually. The problem is that since the OpenMP stuff doesn't really build/run properly at present, it won't be possible (or easy, at least) to test those changes for correctness. So how about this? Just put a comment, // FIXME: refactor this parallel code to use the Eidos_SIMD code path, chunked; see github.com/MesserLab/SLiM/pull/578#issuecomment-3640288984, in the various spots and I'll clean it all up when I return to working on parallel SLiM down the road. Sound good?

~~I can switch over to the allEqual function in add Eidos equivalent of R's all.equal() for equality comparison within a given tolerance #580~~ this is now done ✅

Groovy. :->

Agreed. I can have the OpenMP path call the explicit SIMD functions rather than relying on #pragma omp simd. That way we get guaranteed vectorization regardless of compiler conservatism, and the parallel threads each process their chunk with explicit SIMD.

See above; let's pend that change. Definitely the right change, but the wrong time.

And lastly: are these all of the SIMD optimizations that are straightforward to do in Eidos, or was this just a pilot project? If there are more, it'd be great to get them all in. That work would only benefit models that actually call the Eidos functions in question; but many models do. But the above questions ought to be resolved before doing that work, and maybe that additional work ought to be a new PR. :->

This was a pilot to test the approach. There are more candidates - trig functions (sin, cos, tan), their inverses, and possibly some vector operations. Happy to add those in a follow-up PR once we've settled the OpenMP integration question and the CI is green on this one.

OK, great. It may be that the C++ implementation of all of those is not ideal at present; Eidos has undergone some design shifts, and some code paths got cleaned up more than others. If you hit a function that seems to be kind of a mess in how it's set up, and it makes it harder to SIMD-ize it, let me know and I can clean the code up first before you delve into it. Or provide me with a complete list here of the Eidos functions you intend to SIMD-ize, and I'll just check them all now. :->

So I think once you've got those comments about later refactoring of the OpenMP code paths inserted, this PR will be good to merge...?

andrewkern · 2025-12-13T01:16:04Z

Okay @bhaller -- I put those FIXME comments in. Should be good to go, I think.

wrt further SIMDizing Eidos math functions-- transcendental functions aren't natively handled by SIMD instructions, but there is a quite portable library that most people use called sleef that we could consider bringing in: https://sleef.org/

If you were open to that direction, sleef could be included as a git submodule and we could build it as a static library with all its bells and whistles turned off. Alternatively, sleef has a way to build inline headers where we could include inline implementations really easily with a simple #include sleef.h

this is of course a down the line decision and doesn't affect this PR

bhaller · 2025-12-13T03:47:21Z

Okay @bhaller -- I put those FIXME comments in. Should be good to go, I think.

Thanks, merged!

wrt further SIMDizing Eidos math functions-- transcendental functions aren't natively handled by SIMD instructions, but there is a quite portable library that most people use called sleef that we could consider bringing in: https://sleef.org/

If you were open to that direction, sleef could be included as a git submodule and we could build it as a static library with all its bells and whistles turned off. Alternatively, sleef has a way to build inline headers where we could include inline implementations really easily with a simple #include sleef.h

Very interesting. I've just opened a new issue about SLEEF where we can ponder this. #586

this is of course a down the line decision and doesn't affect this PR

Yep. Thanks @andrewkern!

bhaller · 2025-12-13T04:10:44Z

@andrewkern Uh oh. Weird build failures after merging your PR:

  error: clangarm64: signature from "Christoph Reiter (MSYS2 development key) <reiter.christoph@gmail.com>" is invalid
  error: failed to synchronize all databases (invalid or corrupted database (PGP signature))

This might be a GitHub Actions problem that has nothing to do with you. I found msys2/msys2.github.io#226 and msys2/MSYS2-packages#2397. I have no idea what it all means. I will see whether it is magically fixed tomorrow morning, lol.

andrewkern · 2025-12-13T05:07:04Z

yeah this has nothing to do with the code- it's an error with msys2/setup-msys2@v2

andrewkern marked this pull request as draft November 28, 2025 03:09

andrewkern force-pushed the simd branch from 3337ffa to 641802c Compare December 11, 2025 06:21

andrewkern added 6 commits December 12, 2025 08:36

add SIMD benchmark scripts

05a1d8a

Includes Eidos math function benchmark, SLiM simulation benchmark, and a runner script that builds both SIMD and scalar versions and compares performance.

added credit; added readme to bench; tried to clean whitespace

1d912dc

moar whitespace woes

f6f2412

NEON SIMD for ARM

08e7aa7

swap in new allClose function

c98008d

andrewkern force-pushed the simd branch from 3117fdf to c98008d Compare December 12, 2025 16:42

adding FIXME comments

a9c364f

andrewkern marked this pull request as ready for review December 13, 2025 01:26

bhaller merged commit 6ed869e into MesserLab:master Dec 13, 2025
17 checks passed

bhaller mentioned this pull request Dec 13, 2025

add SLEEF in SLiM for SIMD vectorization of lots of math functions #586

Closed

andrewkern mentioned this pull request Dec 16, 2025

optimizing spatial interaction kernel calculation -- during kd-tree traversal or no? #589

Closed

eidos function SIMD optimizations #578

eidos function SIMD optimizations #578

Uh oh!

Conversation

andrewkern commented Nov 28, 2025

SIMD Vectorization for Eidos Math Functions

Summary

Benchmark Suite

run_benchmarks.sh [num_runs]

simd_benchmark.eidos

slim_benchmark.slim

Benchmark Results

Uh oh!

bhaller commented Dec 5, 2025

Uh oh!

bhaller commented Dec 6, 2025

Uh oh!

andrewkern commented Dec 11, 2025

Uh oh!

bhaller commented Dec 11, 2025

Uh oh!

andrewkern commented Dec 11, 2025

Uh oh!

bhaller commented Dec 11, 2025

Uh oh!

andrewkern commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewkern commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andrewkern commented Dec 11, 2025

Uh oh!

bhaller commented Dec 12, 2025

Uh oh!

andrewkern commented Dec 13, 2025

Uh oh!

Uh oh!

bhaller commented Dec 13, 2025

Uh oh!

bhaller commented Dec 13, 2025

Uh oh!

andrewkern commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`run_benchmarks.sh [num_runs]`

`simd_benchmark.eidos`

`slim_benchmark.slim`

andrewkern commented Dec 11, 2025 •

edited

Loading

andrewkern commented Dec 11, 2025 •

edited

Loading