Conversation
|
Oh, I was thinking much simpler. You can add a method to the trait with a default implementation, like here: Line 196 in 801d6a4 So I would add one that splits the scratch, copies the input, and calls the existing method (the one with mutable input). That way you just need to add it in one place. |
|
Well from a performance perspective, do you think it would be worth it to pass through the immutability to some of the process functions? There are a couple algorithms that actually don't modify the input, so we could eliminate some copies. Here is an example where the input isn't modified at all: RustFFT/src/algorithm/butterflies.rs Lines 20 to 51 in dae4c82 |
|
It's not obvious what way gives the best performance overall. The simple algorithms like the butterflies would benefit from a separate immutable method. But once you build more complicated FFTs that use inner FFTs, then you may end up with a mix of algorithms where some copy, and some don't. Here I would think that doing a single copy before starting the processing would be faster. |
|
Are you thinking of just something like this? fn process_outofplace_with_scratch_naive(
&self,
input: &[Complex<T>],
output: &mut [Complex<T>],
scratch: &mut [Complex<T>],
) {
let (mut input_scratch, scratch) = scratch.split_at_mut(input.len());
input_scratch.copy_from_slice(input);
self.process_outofplace_with_scratch(&mut input_scratch, output, scratch);
}
If you're looking at just the execution time of the FFT, I agree (but I suppose the compiler might be able to make some optimizations from the immutability). Although if you need the input after the FFT, I think there would be a big benefit to pass immutability through. For example: fn process_signal(
fft: Arc<dyn Fft>,
sig: &mut [Complex<f32>],
output: &mut [Complex<f32>],
scratch: &mut [Complex<f32>],
sig_clone: &mut [Complex<f32>],
) -> AudioOutput {
// Copy sig into sig_clone for use later
sig_clone.copy_from_slice(sig);
// Process FFT
fft.process_outofplace_with_scratch(sig, output, scratch);
// Look at FFT'd output to find locations in pre-FFT data to process
let found_audio = find_audio(output);
// Now that we know the locations to process, go back to pre-FFT data
process_audio(found_audio, sig_clone, output)
}Either way, I can run some benchmarks and get back to you. |
|
I ran benchmarks on sizes that I have already implemented: I had the following benchmark setup: const LENGTHS: [usize; 4] = [9, 16, 64, 128];
fn bench_naive(c: &mut Criterion) {
for len in &LENGTHS {
let mut planner = rustfft::FftPlanner::new();
let fft: Arc<dyn Fft<f32>> = planner.plan_fft_forward(*len);
assert_eq!(fft.len(), *len);
let input = vec![Complex::zero(); *len];
let mut output = vec![Complex::zero(); *len];
let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len() + len];
c.bench_function(&format!("naive-{}", len), |b| b.iter(|| {
fft.process_outofplace_with_scratch_naive(&input, &mut output, &mut scratch);
}));
}
}
fn bench_immut(c: &mut Criterion) {
for len in &LENGTHS {
let mut planner = rustfft::FftPlanner::new();
let fft: Arc<dyn Fft<f32>> = planner.plan_fft_forward(*len);
assert_eq!(fft.len(), *len);
let input = vec![Complex::zero(); *len];
let mut output = vec![Complex::zero(); *len];
let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len() + len];
c.bench_function(&format!("immut-{}", len), |b| b.iter(|| {
fft.process_outofplace_with_scratch_immut(&input, &mut output, &mut scratch);
}));
}
}These were my results: I also compared this against the current mutable input |
|
I also finished the implementation of mixed radix, and tested with It appears to me the immutable methods for the butterflies clearly beat out the "naive" default trait implementation. For mixed radix it does not provide the speed boost I was hoping for, although it does still edge out the naive default trait implementation. I'm not sure what you'd like to do with all this info. Perhaps the best path forward is to provide the default trait implementation, and override it for algorithms where the immutability would clearly provide a benefit (butterfly for example)? You're the maintainer, so whatever you think is best. You would also have more of an intuition to what algorithms would benefit from overriding the default trait implementation. |
Yes that's exactly what I had in mind!
That sounds like a good way forward. I suggested the benches to get an idea of how much performance the naive immutable approach costs, compared to doing it "properly" with much larger changes. I took a quick look at Radix-4 (arguably the most important algorithm since it handles all the very common power of twos too large for butterflies) and it doesn't look like it needs mutable input, so it would be a good candidate to add an override to.
No that is @ejmahler. I have spent quite some time working on this library and try to help out with issues etc whenever I can, but I'm just a contributor. |
My mistake; I noticed you were very active in this repo so thought maybe you were a maintainer as well. Regardless, I'll go down that path and look at Radix-4, unless @ejmahler prefers a different direction. |
|
One problem with the default impl approach is that the fft algorithms have to request scratch from the caller, and the default impl doesn't add that requested scratch. We could add another method to the fft trait that returns the required immutable out of place scratch, but I find it pretty difficult keeping even two in my head when writing new fft algorithms, so I have concerns about adding a third. I see a few options, each with their own tradeoffs. I don't really have a preferred option as I start writing this, I just want to get my thoughts out:
Now that I've written all of that out, I think if I could rank the approaches in order preference, I'd sort them #4, #5, #1, #2, #3. It's tempting to say I just prefer the status quo, but we've talked about radix4 and how it's A: the most common path, and B: already doesn't need to overwrite the input except in rare cases, so if we take approach #4 or #5 then we're making the library strictly better for the most common user, which I find to be a pretty compelling argument. So how does the following approach sound?
The last question is what the default implementation of fn process_immutable_with_scratch(
&self,
input: &[Complex<T>],
output: &mut [Complex<T>],
scratch: &mut [Complex<T>],
) {
output.copy_from_slice(input);
self.process_with_scratch(output, scratch);
}
fn get_immutable_scratch_len(&self) -> usize {
self.get_inplace_scratch_len()
}fn process_immutable_with_scratch(
&self,
input: &[Complex<T>],
output: &mut [Complex<T>],
scratch: &mut [Complex<T>],
) {
let (mut input_scratch, scratch) = scratch.split_at_mut(input.len());
input_scratch.copy_from_slice(input);
self.process_outofplace_with_scratch(&mut input_scratch, output, scratch);
}
fn get_immutable_scratch_len(&self) -> usize {
self.get_outofplace_scratch_len() + self.len()
} |
|
Actually, feel free to skip it for RadixNfor now, because i don't even know if it's gonna make it into the next release. I wish I had put it on a branch, in retrospect |
|
I think that suggested approach makes sense. Regarding the default impl, I prefer the default impl of an in place as well. The scratch length would then match what is already existing. |
|
Before I make too much progress, I just wanted to put this here to get any feedback before I get too far down this road to make sure you had the same idea in mind: master...michaelciraci:RustFFT:f632083c63b9aee50319946fa1c3b639298ccfd2 |
|
Yeah so far it looks good. I was imagining that the butterfly boilerplate could be a lot shorter, but since the data validation is woven in and the data validation mentions specific method names, that might be harder than i thought. |
|
Implementation in PR here: #157 |
I took at stab at implementing an immutable function call like this: #150
I didn't implement sse/neon/wasm yet, as I wanted to get feedback before I add those architectures as well.