-
Notifications
You must be signed in to change notification settings - Fork 36
eidos function SIMD optimizations #578
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @andrewkern! I've just gotten around to reviewing this PR. Overall it looks great! Can I ask you to do a couple of things?
Besides those to-do items, I have a few thoughts as well. First of all, a very nice thing about OpenMP is that the same code path is used for both single-threaded and multi-threaded builds; the I'm also pondering the way you needed to change the self-tests to be tolerant of small errors. (1) Maybe there ought to be a built-in way in Eidos to compare for equality with a given tolerance? It comes up again and again in the self-tests. Maybe there's an R function I ought to emulate for this. Hmm, looks like And lastly: are these all of the SIMD optimizations that are straightforward to do in Eidos, or was this just a pilot project? If there are more, it'd be great to get them all in. That work would only benefit models that actually call the Eidos functions in question; but many models do. But the above questions ought to be resolved before doing that work, and maybe that additional work ought to be a new PR. :-> Thanks a bunch! I'll be very happy to merge this once the above has been addressed! :-> |
|
OK, I've implemented #580 to test for equality within a given tolerance. If you want to use that you can, but no worries; at some point I'll go through and switch over to using |
|
Progress update:
This is a PITA. Apple silicon doesn't support SIMD in the same way that X86_64 based processors do.... I'm working on this. Should be doable.... I'll hopefully get this done soon
More on your other questions/thoughts soon |
Apple has to be different. :-> Sorry, sounds like a PITA indeed. If it turns out not to be workable on Apple Silicon if can just get #ifdef'ed I suppose.
If the whitespace issues are driving you crazy then forget about that. :->
OK. Thanks. |
|
this is the ARM SIMD: https://developer.arm.com/Architectures/Neon working on it now |
|
@andrewkern I had no idea you're such a bad-ass programmer! Kudos! Ping me when you're ready for another review of this PR. Maybe you should quit that Director job of yours and come work on SLiM with me! :-> |
nah, i'm no badass, just a bit of a hardware junkie and willing to google 😄 it looks like this now passes I'm going to edit the workflow to move the |
Good point. The current structure does create multiple code paths to maintain. Your suggestion to have OpenMP use the SIMD functions makes sense. The OpenMP #pragma omp parallel for simd currently asks the compiler to auto-vectorize, but that's redundant if we have explicit SIMD. We could restructure to: #ifdef _OPENMP
EIDOS_THREAD_COUNT(...);
#pragma omp parallel for ... // no simd clause
for (int chunk = 0; chunk < num_chunks; chunk++)
Eidos_SIMD::sqrt_float64(&input[chunk*size], &output[chunk*size], size);
#else
Eidos_SIMD::sqrt_float64(input, output, count);
#endifThis way:
Want me to refactor it this way?
Good points on both:
This was a pilot to test the approach. There are more candidates - trig functions (sin, cos, tan), their inverses, and possibly some vector operations. Happy to add those in a follow-up PR once we've settled the OpenMP integration question and the CI is green on this one. |
|
okay all tests now pass. |
Adds compile-time SIMD detection (AVX2/SSE4.2/FMA) and vectorized
implementations for sqrt, abs, floor, ceil, round, trunc, sum, and
product. Benchmarks show 1.4-5.7x speedups on large float arrays.
- CMakeLists.txt: add USE_SIMD option and compiler flag detection
- eidos/eidos_simd.h: new header with SIMD intrinsic implementations
- eidos/eidos_functions_math.cpp: use SIMD paths when available
- eidos/eidos_test_*.{h,cpp}: use tolerance for float comparisons
Includes Eidos math function benchmark, SLiM simulation benchmark, and a runner script that builds both SIMD and scalar versions and compares performance.
Hmm. I'm going to say no. Your refactor is perfect and I'll do it eventually. The problem is that since the OpenMP stuff doesn't really build/run properly at present, it won't be possible (or easy, at least) to test those changes for correctness. So how about this? Just put a comment,
Groovy. :->
See above; let's pend that change. Definitely the right change, but the wrong time.
OK, great. It may be that the C++ implementation of all of those is not ideal at present; Eidos has undergone some design shifts, and some code paths got cleaned up more than others. If you hit a function that seems to be kind of a mess in how it's set up, and it makes it harder to SIMD-ize it, let me know and I can clean the code up first before you delve into it. Or provide me with a complete list here of the Eidos functions you intend to SIMD-ize, and I'll just check them all now. :-> So I think once you've got those comments about later refactoring of the OpenMP code paths inserted, this PR will be good to merge...? |
|
Okay @bhaller -- I put those FIXME comments in. Should be good to go, I think. wrt further SIMDizing Eidos math functions-- transcendental functions aren't natively handled by SIMD instructions, but there is a quite portable library that most people use called If you were open to that direction, this is of course a down the line decision and doesn't affect this PR |
Thanks, merged!
Very interesting. I've just opened a new issue about SLEEF where we can ponder this. #586
Yep. Thanks @andrewkern! |
|
@andrewkern Uh oh. Weird build failures after merging your PR: This might be a GitHub Actions problem that has nothing to do with you. I found msys2/msys2.github.io#226 and msys2/MSYS2-packages#2397. I have no idea what it all means. I will see whether it is magically fixed tomorrow morning, lol. |
|
yeah this has nothing to do with the code- it's an error with |
SIMD Vectorization for Eidos Math Functions
Summary
eidos/eidos_simd.hwith SSE4.2/AVX2 implementations for math operations-DUSE_SIMD=ON/OFFCMake option (default: ON) with automatic CPU detectionVectorized functions:
sqrt,abs,floor,ceil,trunc,round,sum,productAVX2 processes 4 doubles/instruction; SSE4.2 processes 2 doubles/instruction.
Benchmark Suite
run_benchmarks.sh [num_runs]Builds SLiM with and without SIMD, runs benchmarks, reports speedup.
simd_benchmark.eidosTests math functions on 1M element arrays (100 iterations each).
slim_benchmark.slimFull simulation benchmark: N=5000, 1Mb chromosome, 5000 generations with selection.
Benchmark Results
$ simd_benchmarks/run_benchmarks.sh ============================================ SIMD Benchmark Runner ============================================ SLiM root: /home/adkern/SLiM Runs per benchmark: 3 Building with SIMD enabled... Done. Building with SIMD disabled... Done. ============================================ Eidos Math Function Benchmarks ============================================ SIMD Build: Running Eidos benchmark (SIMD)... sqrt(): 0.105 sec abs(): 0.171 sec floor(): 0.164 sec ceil(): 0.166 sec round(): 0.164 sec trunc(): 0.165 sec sum(): 0.032 sec product(): 0.003 sec (1000 elements, 10000 iters) Scalar Build: Running Eidos benchmark (Scalar)... sqrt(): 0.108 sec abs(): 0.166 sec floor(): 0.231 sec ceil(): 0.246 sec round(): 0.473 sec trunc(): 0.246 sec sum(): 0.166 sec product(): 0.017 sec (1000 elements, 10000 iters) ============================================ SLiM Simulation Benchmark (N=5000, 5000 generations, selection) ============================================ Running 3 iterations each... SIMD Build: 12.756s (avg) Scalar Build: 12.316s (avg) Speedup: .96x ============================================ Benchmark complete ============================================