Skip to content

Conversation

@andrewkern
Copy link
Collaborator

SIMD Vectorization for Eidos Math Functions

Summary

  • Add eidos/eidos_simd.h with SSE4.2/AVX2 implementations for math operations
  • Add -DUSE_SIMD=ON/OFF CMake option (default: ON) with automatic CPU detection
  • Update math functions to use SIMD as the non-OpenMP fallback path

Vectorized functions: sqrt, abs, floor, ceil, trunc, round, sum, product

AVX2 processes 4 doubles/instruction; SSE4.2 processes 2 doubles/instruction.

Benchmark Suite

run_benchmarks.sh [num_runs]

Builds SLiM with and without SIMD, runs benchmarks, reports speedup.

./run_benchmarks.sh      # 3 runs (default)
./run_benchmarks.sh 10   # 10 runs

simd_benchmark.eidos

Tests math functions on 1M element arrays (100 iterations each).

slim_benchmark.slim

Full simulation benchmark: N=5000, 1Mb chromosome, 5000 generations with selection.

Benchmark Results

$ simd_benchmarks/run_benchmarks.sh 
============================================
SIMD Benchmark Runner
============================================
SLiM root: /home/adkern/SLiM
Runs per benchmark: 3

Building with SIMD enabled...
  Done.
Building with SIMD disabled...
  Done.

============================================
Eidos Math Function Benchmarks
============================================

SIMD Build:
  Running Eidos benchmark (SIMD)...
    sqrt():    0.105 sec
    abs():     0.171 sec
    floor():   0.164 sec
    ceil():    0.166 sec
    round():   0.164 sec
    trunc():   0.165 sec
    sum():     0.032 sec
    product(): 0.003 sec (1000 elements, 10000 iters)

Scalar Build:
  Running Eidos benchmark (Scalar)...
    sqrt():    0.108 sec
    abs():     0.166 sec
    floor():   0.231 sec
    ceil():    0.246 sec
    round():   0.473 sec
    trunc():   0.246 sec
    sum():     0.166 sec
    product(): 0.017 sec (1000 elements, 10000 iters)

============================================
SLiM Simulation Benchmark
(N=5000, 5000 generations, selection)
============================================

Running 3 iterations each...

SIMD Build:   12.756s (avg)
Scalar Build: 12.316s (avg)

Speedup: .96x

============================================
Benchmark complete
============================================

@andrewkern andrewkern marked this pull request as draft November 28, 2025 03:09
@bhaller
Copy link
Contributor

bhaller commented Dec 5, 2025

Hi @andrewkern! I've just gotten around to reviewing this PR. Overall it looks great! Can I ask you to do a couple of things?

  • change eidos_simd.h to credit you, not me :->
  • add a README file to simd_benchmarks giving credit to yourself, explaining the purpose of the directory, and mentioning that the files in it are not used in the build of SLiM but are simply provided for internal development use
  • it looks like GitHub's CI failed for several builds above, for reasons that I don't understand but have not dug into, so that needs to be fixed
  • there are various places where the tab indentation of blank lines was changed for no particular reason, making the diffs harder to read/review; it'd be nice if those got cleaned up to make the diffs minimal.

Besides those to-do items, I have a few thoughts as well. First of all, a very nice thing about OpenMP is that the same code path is used for both single-threaded and multi-threaded builds; the #pragmas for OpenMP just modify the behavior of an existing chunk of code to parallelize it. This means that there is only one code path to maintain, you just need to think about it in two ways (single- and multi-threaded). With these SIMD changes that is no longer the case, which strikes me as unfortunate and bug-prone. The code could be restructured so that the SIMD case is treated as the special case with #ifdef and the non-SIMD case, for both OpenMP and non-OpenMP, uses the same shared code path. But that is probably pointless and counter-productive, really, since 99% of compilers/hardware these days supports SIMD, I imagine. Perhaps a better solution would be to have the OpenMP path use your SIMD implementation as well. At the moment the OpenMP code paths request SIMD optimizations from OpenMP, but I think (?) that could be removed and it could use your SIMD code instead, right? Then all the different possibilities (SIMD present/absent, OpenMP present/absent) would all be handled by the same code path. What are your thoughts on that?

I'm also pondering the way you needed to change the self-tests to be tolerant of small errors. (1) Maybe there ought to be a built-in way in Eidos to compare for equality with a given tolerance? It comes up again and again in the self-tests. Maybe there's an R function I ought to emulate for this. Hmm, looks like all.equal() is in base R, I ought to add something like that. I just opened #580 to track this. (2) This tolerance issue is a good reason to use your SIMD code in the parallel case as well, rather than relying on OpenMP's SIMD support, I think. As far as I recall, OpenMP only actually does SIMD optimizations when the result is guaranteed to be identical, otherwise it declines. I seem to recall getting pretty unsatisfactory results from its SIMD feature as a result. Your SIMD approach is nice in that it just forces the SIMD optimization to happen; it isn't up to the conservative black-box heuristics of OpenMP or the compiler to make it happen. (I tried making some non-OpenMP loops in Eidos be SIMD-compliant in the hopes that the compiler would optimize them with SIMD automatically, but I think that it, too, declines to do so unless the result is exact.)

And lastly: are these all of the SIMD optimizations that are straightforward to do in Eidos, or was this just a pilot project? If there are more, it'd be great to get them all in. That work would only benefit models that actually call the Eidos functions in question; but many models do. But the above questions ought to be resolved before doing that work, and maybe that additional work ought to be a new PR. :->

Thanks a bunch! I'll be very happy to merge this once the above has been addressed! :->

@bhaller
Copy link
Contributor

bhaller commented Dec 6, 2025

OK, I've implemented #580 to test for equality within a given tolerance. If you want to use that you can, but no worries; at some point I'll go through and switch over to using isClose()/allClose() in all of the unit tests that use a tolerance value, so I'll fix your tests along with all the others.

@andrewkern
Copy link
Collaborator Author

Progress update:

Hi @andrewkern! I've just gotten around to reviewing this PR. Overall it looks great! Can I ask you to do a couple of things?

  • change eidos_simd.h to credit you, not me :->
  • add a README file to simd_benchmarks giving credit to yourself, explaining the purpose of the directory, and mentioning that the files in it are not used in the build of SLiM but are simply provided for internal development use
  • it looks like GitHub's CI failed for several builds above, for reasons that I don't understand but have not dug into, so that needs to be fixed

This is a PITA. Apple silicon doesn't support SIMD in the same way that X86_64 based processors do.... I'm working on this. Should be doable.... I'll hopefully get this done soon

  • there are various places where the tab indentation of blank lines was changed for no particular reason, making the diffs harder to read/review; it'd be nice if those got cleaned up to make the diffs minimal.

More on your other questions/thoughts soon

@bhaller
Copy link
Contributor

bhaller commented Dec 11, 2025

Progress update:

...This is a PITA. Apple silicon doesn't support SIMD in the same way that X86_64 based processors do.... I'm working on this. Should be doable.... I'll hopefully get this done soon

Apple has to be different. :-> Sorry, sounds like a PITA indeed. If it turns out not to be workable on Apple Silicon if can just get #ifdef'ed I suppose.

  • there are various places where the tab indentation of blank lines was changed for no particular reason, making the diffs harder to read/review; it'd be nice if those got cleaned up to make the diffs minimal.

If the whitespace issues are driving you crazy then forget about that. :->

More on your other questions/thoughts soon

OK. Thanks.

@andrewkern
Copy link
Collaborator Author

this is the ARM SIMD: https://developer.arm.com/Architectures/Neon

working on it now

@bhaller
Copy link
Contributor

bhaller commented Dec 11, 2025

@andrewkern I had no idea you're such a bad-ass programmer! Kudos! Ping me when you're ready for another review of this PR. Maybe you should quit that Director job of yours and come work on SLiM with me! :->

@andrewkern
Copy link
Collaborator Author

andrewkern commented Dec 11, 2025

@andrewkern I had no idea you're such a bad-ass programmer! Kudos! Ping me when you're ready for another review of this PR. Maybe you should quit that Director job of yours and come work on SLiM with me! :->

nah, i'm no badass, just a bit of a hardware junkie and willing to google 😄

it looks like this now passes macos-15 tests, but the macos-13 runner has been retired by GH: actions/runner-images#13046

I'm going to edit the workflow to move the macos-13 runner to macos-14 -- this is something you'll want to include in the master branch for sure

@andrewkern
Copy link
Collaborator Author

andrewkern commented Dec 11, 2025

Besides those to-do items, I have a few thoughts as well. First of all, a very nice thing about OpenMP is that the same code path is used for both single-threaded and multi-threaded builds; the #pragmas for OpenMP just modify the behavior of an existing chunk of code to parallelize it. This means that there is only one code path to maintain, you just need to think about it in two ways (single- and multi-threaded). With these SIMD changes that is no longer the case, which strikes me as unfortunate and bug-prone. The code could be restructured so that the SIMD case is treated as the special case with #ifdef and the non-SIMD case, for both OpenMP and non-OpenMP, uses the same shared code path. But that is probably pointless and counter-productive, really, since 99% of compilers/hardware these days supports SIMD, I imagine. Perhaps a better solution would be to have the OpenMP path use your SIMD implementation as well. At the moment the OpenMP code paths request SIMD optimizations from OpenMP, but I think (?) that could be removed and it could use your SIMD code instead, right? Then all the different possibilities (SIMD present/absent, OpenMP present/absent) would all be handled by the same code path. What are your thoughts on that?

Good point. The current structure does create multiple code paths to maintain. Your suggestion to have OpenMP use the SIMD functions makes sense. The OpenMP #pragma omp parallel for simd currently asks the compiler to auto-vectorize, but that's redundant if we have explicit SIMD. We could restructure to:

#ifdef _OPENMP
    EIDOS_THREAD_COUNT(...);
    #pragma omp parallel for ...  // no simd clause
    for (int chunk = 0; chunk < num_chunks; chunk++)
        Eidos_SIMD::sqrt_float64(&input[chunk*size], &output[chunk*size], size);
#else
    Eidos_SIMD::sqrt_float64(input, output, count);
#endif

This way:

  • SIMD functions handle vectorization (with scalar fallback built-in)
  • OpenMP just handles thread parallelism when enabled
  • One SIMD implementation serves all cases
  • The SIMD functions already have scalar fallbacks, so SIMD present/absent is handled internally.

Want me to refactor it this way?

I'm also pondering the way you needed to change the self-tests to be tolerant of small errors. (1) Maybe there ought to be a built-in way in Eidos to compare for equality with a given tolerance? It comes up again and again in the self-tests. Maybe there's an R function I ought to emulate for this. Hmm, looks like all.equal() is in base R, I ought to add something like that. I just opened #580 to track this. (2) This tolerance issue is a good reason to use your SIMD code in the parallel case as well, rather than relying on OpenMP's SIMD support, I think. As far as I recall, OpenMP only actually does SIMD optimizations when the result is guaranteed to be identical, otherwise it declines. I seem to recall getting pretty unsatisfactory results from its SIMD feature as a result. Your SIMD approach is nice in that it just forces the SIMD optimization to happen; it isn't up to the conservative black-box heuristics of OpenMP or the compiler to make it happen. (I tried making some non-OpenMP loops in Eidos be SIMD-compliant in the hopes that the compiler would optimize them with SIMD automatically, but I think that it, too, declines to do so unless the result is exact.)

Good points on both:

And lastly: are these all of the SIMD optimizations that are straightforward to do in Eidos, or was this just a pilot project? If there are more, it'd be great to get them all in. That work would only benefit models that actually call the Eidos functions in question; but many models do. But the above questions ought to be resolved before doing that work, and maybe that additional work ought to be a new PR. :->

This was a pilot to test the approach. There are more candidates - trig functions (sin, cos, tan), their inverses, and possibly some vector operations. Happy to add those in a follow-up PR once we've settled the OpenMP integration question and the CI is green on this one.

@andrewkern
Copy link
Collaborator Author

okay all tests now pass.

Adds compile-time SIMD detection (AVX2/SSE4.2/FMA) and vectorized
implementations for sqrt, abs, floor, ceil, round, trunc, sum, and
product. Benchmarks show 1.4-5.7x speedups on large float arrays.

- CMakeLists.txt: add USE_SIMD option and compiler flag detection
- eidos/eidos_simd.h: new header with SIMD intrinsic implementations
- eidos/eidos_functions_math.cpp: use SIMD paths when available
- eidos/eidos_test_*.{h,cpp}: use tolerance for float comparisons
Includes Eidos math function benchmark, SLiM simulation benchmark,
and a runner script that builds both SIMD and scalar versions and
compares performance.
@bhaller
Copy link
Contributor

bhaller commented Dec 12, 2025

...Your suggestion to have OpenMP use the SIMD functions makes sense. The OpenMP #pragma omp parallel for simd currently asks the compiler to auto-vectorize, but that's redundant if we have explicit SIMD. We could restructure to:
...
Want me to refactor it this way?

Hmm. I'm going to say no. Your refactor is perfect and I'll do it eventually. The problem is that since the OpenMP stuff doesn't really build/run properly at present, it won't be possible (or easy, at least) to test those changes for correctness. So how about this? Just put a comment, // FIXME: refactor this parallel code to use the Eidos_SIMD code path, chunked; see github.com/MesserLab/SLiM/pull/578#issuecomment-3640288984, in the various spots and I'll clean it all up when I return to working on parallel SLiM down the road. Sound good?

Groovy. :->

  • Agreed. I can have the OpenMP path call the explicit SIMD functions rather than relying on #pragma omp simd. That way we get guaranteed vectorization regardless of compiler conservatism, and the parallel threads each process their chunk with explicit SIMD.

See above; let's pend that change. Definitely the right change, but the wrong time.

And lastly: are these all of the SIMD optimizations that are straightforward to do in Eidos, or was this just a pilot project? If there are more, it'd be great to get them all in. That work would only benefit models that actually call the Eidos functions in question; but many models do. But the above questions ought to be resolved before doing that work, and maybe that additional work ought to be a new PR. :->

This was a pilot to test the approach. There are more candidates - trig functions (sin, cos, tan), their inverses, and possibly some vector operations. Happy to add those in a follow-up PR once we've settled the OpenMP integration question and the CI is green on this one.

OK, great. It may be that the C++ implementation of all of those is not ideal at present; Eidos has undergone some design shifts, and some code paths got cleaned up more than others. If you hit a function that seems to be kind of a mess in how it's set up, and it makes it harder to SIMD-ize it, let me know and I can clean the code up first before you delve into it. Or provide me with a complete list here of the Eidos functions you intend to SIMD-ize, and I'll just check them all now. :->

So I think once you've got those comments about later refactoring of the OpenMP code paths inserted, this PR will be good to merge...?

@andrewkern
Copy link
Collaborator Author

Okay @bhaller -- I put those FIXME comments in. Should be good to go, I think.

wrt further SIMDizing Eidos math functions-- transcendental functions aren't natively handled by SIMD instructions, but there is a quite portable library that most people use called sleef that we could consider bringing in: https://sleef.org/

If you were open to that direction, sleef could be included as a git submodule and we could build it as a static library with all its bells and whistles turned off. Alternatively, sleef has a way to build inline headers where we could include inline implementations really easily with a simple #include sleef.h

this is of course a down the line decision and doesn't affect this PR

@andrewkern andrewkern marked this pull request as ready for review December 13, 2025 01:26
@bhaller bhaller merged commit 6ed869e into MesserLab:master Dec 13, 2025
17 checks passed
@bhaller
Copy link
Contributor

bhaller commented Dec 13, 2025

Okay @bhaller -- I put those FIXME comments in. Should be good to go, I think.

Thanks, merged!

wrt further SIMDizing Eidos math functions-- transcendental functions aren't natively handled by SIMD instructions, but there is a quite portable library that most people use called sleef that we could consider bringing in: https://sleef.org/

If you were open to that direction, sleef could be included as a git submodule and we could build it as a static library with all its bells and whistles turned off. Alternatively, sleef has a way to build inline headers where we could include inline implementations really easily with a simple #include sleef.h

Very interesting. I've just opened a new issue about SLEEF where we can ponder this. #586

this is of course a down the line decision and doesn't affect this PR

Yep. Thanks @andrewkern!

@bhaller
Copy link
Contributor

bhaller commented Dec 13, 2025

@andrewkern Uh oh. Weird build failures after merging your PR:

  error: clangarm64: signature from "Christoph Reiter (MSYS2 development key) <reiter.christoph@gmail.com>" is invalid
  error: failed to synchronize all databases (invalid or corrupted database (PGP signature))

This might be a GitHub Actions problem that has nothing to do with you. I found msys2/msys2.github.io#226 and msys2/MSYS2-packages#2397. I have no idea what it all means. I will see whether it is magically fixed tomorrow morning, lol.

@andrewkern
Copy link
Collaborator Author

yeah this has nothing to do with the code- it's an error with msys2/setup-msys2@v2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants