-
Notifications
You must be signed in to change notification settings - Fork 58
Description
I've been benchmarking RustFFT with criterion to compare against my own FFT implementation. To my surprise, I've found that RustFFT performs better for small to medium sizes of f32 (up to 2097152 elements) when using .process() as opposed to .process_with_scratch(). This effect does not hold for f64, which is always faster with .process_with_scratch().
I can reproduce this on Zen 4 CPU on Linux but not on Apple M4 with Mac OS, where .process_with_scratch() is neutral or beneficial even for f32.
The exact code used for the measurements can be found in QuState/PhastFT#81
The command to run benchmarks is cargo bench --bench=bench RustFFT; you can run it before and after the PR linked above to reproduce the measurements.
I find this very surprising considering that the implementation of .process() is just this:
Lines 195 to 198 in 4758ab0
| fn process(&self, buffer: &mut [Complex<T>]) { | |
| let mut scratch = vec![Complex::zero(); self.get_inplace_scratch_len()]; | |
| self.process_with_scratch(buffer, &mut scratch); | |
| } |
I'm not sure what could be done about it, so feel free to close this. But it's an interesting enough and dramatic enough anomaly that I figured I should let you know.