Skip to content

Conversation

@Anri-Lombard
Copy link
Contributor

@Anri-Lombard Anri-Lombard commented Jan 24, 2026

Proposed changes

Fixes #3047 - These functions were using float32 polynomial approximations for all inputs, causing precision loss for float64. Added proper double precision paths using element-wise std:: calls for exp/sin/cos/erf, and Julia's SpecialFunctions.jl rational approximation for erfinv.

Questions for @awni:

  1. For erfinv, I used coefficients from Julia's SpecialFunctions.jl (Blair et al. 1976, 3-region rational approximation, ~1e-13 relative error). Is this acceptable?
  2. I added float64 variants and a --cpu flag to the benchmark. Is it useful to include this or rather remove these changes?

Checklist

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

These functions were using float32 polynomial approximations for all
inputs, causing precision loss for float64. Added proper double
precision paths using element-wise std:: calls for exp/sin/cos/erf,
and Julia's SpecialFunctions.jl rational approximation for erfinv.
Comment on lines +30 to +32
if constexpr (N == 1) {
return Simd<T, 1>{std::exp(in.value)};
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the check for N=1 here necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now Simd<T, 1> (in base_simd.h) doesn't have operator[] - only the general Simd<T, N> in accelerate_simd.h does. So we need the N=1 check to use .value instead of [i]... I think adding operator there could remove this check so that's probably cleaner 🙏

@awni
Copy link
Member

awni commented Jan 24, 2026

The purpose of the issue (#3047), perhaps implicitly, was to add actual vectorized implementations rather than unrolled element-wise implementations. Are you interested in making that update? So basically we'd need to find good approximations for exp, sin/cos, erf and use simd functions to implement them. And similarly for erfinv it should be translated into a simd version.

@Anri-Lombard
Copy link
Contributor Author

Hey @awni, so my original approach was using Cephes polynomials, but I ran the benchmark and it was slower for everything except exp (where it was only ~4% faster). For sin and cos it was about 50% slower. But I'm happy to change the implementation if a vectorized implementation is preferred 👍 Perhaps there are reasons I'm not thinking of as well?

@awni
Copy link
Member

awni commented Jan 25, 2026

but I ran the benchmark and it was slower for everything except exp (where it was only ~4% faster). For sin and cos it was about 50% slower.

That's interesting and unexpected. What machine were you running benchmarks on?

@Anri-Lombard
Copy link
Contributor Author

but I ran the benchmark and it was slower for everything except exp (where it was only ~4% faster). For sin and cos it was about 50% slower.

That's interesting and unexpected. What machine were you running benchmarks on?

I'm using an M1 Macbook Pro 🙏
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] [CPU] Some ops fallback to fp32 implementation even fro fp64

2 participants