Skip to content

Conversation

@andrewkern
Copy link
Collaborator

Adds SIMD-optimized (AVX2/NEON) kernel functions for spatial interactions using SLEEF for transcendental functions. All kernel types except Fixed now use a two-pass approach (build distances, then batch transform) enabling vectorized strength calculations.

Summary Benchmark

50k individuals, ~2262 neighbors

Kernel Original Final Improvement
Fixed 31.97s 31.36s -2% (special-cased)
Linear 37.26s 32.95s -12%
Exponential 59.58s 34.88s -41%
Normal 56.37s 35.15s -38%
Cauchy 40.04s 33.00s -18%
Student's T 130.10s 49.76s -62%
TOTAL 356.04s 217.80s -39%

…ential)

- Add float SLEEF macros to sleef_config.h (AVX2: 8 floats, NEON: 4 floats)
- Add exp_kernel_float32() and normal_kernel_float32() to eidos_simd.h
- Use SIMD kernels in FillSparseVectorForReceiverStrengths()
- Add benchmark script for spatial interaction kernels
Modify FillSparseVectorForReceiverStrengths() to use the two-pass
distance-then-transform path for most kernel types in 2D, enabling
SIMD optimizations for Exponential and Normal kernels. The Fixed
kernel retains the original single-pass special-case path since it
doesn't benefit from SIMD (just assigns a constant).

Benchmarks show 22% overall speedup at high neighbor counts (~2200),
with Exponential and Normal kernels seeing 38-42% improvement.
Add SLEEF pow() function support (Sleef_powf8_u10avx2 for AVX2,
Sleef_powf4_u10advsimd for NEON) and implement tdist_kernel_float32()
to vectorize the Student's T distribution kernel calculation.

The kernel computes: strength = fmax / pow(1 + (d/tau)^2 / nu, (nu+1)/2)

Benchmarks show 62% speedup for Student's T kernel (130s -> 49s),
contributing to 38% overall speedup for spatial interaction benchmarks.
Implement cauchy_kernel_float32() using AVX2/NEON intrinsics for the
Cauchy kernel calculation: strength = fmax / (1 + (d/lambda)^2)

Unlike exp/normal/tdist kernels, Cauchy uses only basic arithmetic
operations (multiply, divide, add) so no SLEEF functions are needed.

Benchmarks show 18% speedup vs original (40s -> 33s).
Implement linear_kernel_float32() using AVX2/NEON intrinsics for the
Linear kernel calculation: strength = fmax * (1 - d / max_distance)

Rewritten as: strength = fmax - d * (fmax / max_distance)

Simple arithmetic (multiply + subtract) so gains are modest (~2%),
but provides consistency with other SIMD-optimized kernels.
Remove sleef_benchmark_spatial_interaction.slim (only tested Gaussian/Exponential)
and add benchmark_all_kernels.slim which tests all 6 kernel types:
Fixed, Linear, Exponential, Normal, Cauchy, and Student's T.
Add documentation for benchmark_all_kernels.slim script including:
- Entry in Contents section describing the 6 kernel types tested
- Performance results table showing SIMD speedups on AVX2
- Usage instructions for running with SIMD vs scalar builds
- Notes on adjusting neighbor density via W parameter

🤖 Generated with [Claude Code](https://claude.com/claude-code)
@bhaller
Copy link
Contributor

bhaller commented Dec 17, 2025

This looks good to me. Awesome performance improvements. You are knocking it out of the park!

@bhaller bhaller merged commit 144b04a into MesserLab:master Dec 17, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants