Make input immutable by michaelciraci · Pull Request #151 · ejmahler/RustFFT

michaelciraci · 2025-03-09T04:42:48Z

I took at stab at implementing an immutable function call like this: #150

I didn't implement sse/neon/wasm yet, as I wanted to get feedback before I add those architectures as well.

HEnquist · 2025-03-09T06:14:41Z

Oh, I was thinking much simpler. You can add a method to the trait with a default implementation, like here:

Line 196 in 801d6a4

fn process(&self, buffer: &mut [Complex<T>]) {

So I would add one that splits the scratch, copies the input, and calls the existing method (the one with mutable input). That way you just need to add it in one place.

michaelciraci · 2025-03-09T06:35:54Z

Well from a performance perspective, do you think it would be worth it to pass through the immutability to some of the process functions? There are a couple algorithms that actually don't modify the input, so we could eliminate some copies.

Here is an example where the input isn't modified at all:

RustFFT/src/algorithm/butterflies.rs

Lines 20 to 51 in dae4c82

    
           fn process_outofplace_with_scratch_immut( 
        
               &self, 
        
               input: &[Complex<T>], 
        
               output: &mut [Complex<T>], 
        
               _scratch: &mut [Complex<T>], 
        
           ) { 
        
               if input.len() < self.len() || output.len() != input.len() { 
        
                   // We want to trigger a panic, but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us 
        
                   fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); 
        
                   return; // Unreachable, because fft_error_outofplace asserts, but it helps codegen to put it here 
        
               } 
        
               let result = array_utils::iter_chunks_zipped( 
        
                   input, 
        
                   output, 
        
                   self.len(), 
        
                   |in_chunk, out_chunk| { 
        
                       unsafe { 
        
                           self.perform_fft_butterfly(DoubleBuf { 
        
                               input: in_chunk, 
        
                               output: out_chunk, 
        
                           }) 
        
                       }; 
        
                   }, 
        
               ); 
        
               if result.is_err() { 
        
                   // We want to trigger a panic, because the buffer sizes weren't cleanly divisible by the FFT size, 
        
                   // but we want to avoid doing it in this function to reduce code size, so call a function marked cold and inline(never) that will do it for us 
        
                   fft_error_outofplace(self.len(), input.len(), output.len(), 0, 0); 
        
               } 
        
           }

HEnquist · 2025-03-09T09:23:42Z

It's not obvious what way gives the best performance overall. The simple algorithms like the butterflies would benefit from a separate immutable method. But once you build more complicated FFTs that use inner FFTs, then you may end up with a mix of algorithms where some copy, and some don't. Here I would think that doing a single copy before starting the processing would be faster.
How about measuring to see if it's worth the effort? The best the immutable process method can do is to be just as fast as the mutable one. So if we just implement the simple default implementation and add some benches to compare mutable and immutable, that would give a good idea of how much could be gained.

michaelciraci · 2025-03-09T18:36:42Z

Are you thinking of just something like this?

fn process_outofplace_with_scratch_naive(
        &self,
        input: &[Complex<T>],
        output: &mut [Complex<T>],
        scratch: &mut [Complex<T>],
    ) {
        let (mut input_scratch, scratch) = scratch.split_at_mut(input.len());
        input_scratch.copy_from_slice(input);
        self.process_outofplace_with_scratch(&mut input_scratch, output, scratch);
    }

The best the immutable process method can do is to be just as fast as the mutable one.

If you're looking at just the execution time of the FFT, I agree (but I suppose the compiler might be able to make some optimizations from the immutability). Although if you need the input after the FFT, I think there would be a big benefit to pass immutability through. For example:

fn process_signal(
        fft: Arc<dyn Fft>,
        sig: &mut [Complex<f32>],
        output: &mut [Complex<f32>],
        scratch: &mut [Complex<f32>],
        sig_clone: &mut [Complex<f32>],
        ) -> AudioOutput {
    // Copy sig into sig_clone for use later
    sig_clone.copy_from_slice(sig);
    // Process FFT
    fft.process_outofplace_with_scratch(sig, output, scratch);
    
    // Look at FFT'd output to find locations in pre-FFT data to process
    let found_audio = find_audio(output);

    // Now that we know the locations to process, go back to pre-FFT data
    process_audio(found_audio, sig_clone, output)
}

Either way, I can run some benchmarks and get back to you.

michaelciraci · 2025-03-09T18:55:36Z

I ran benchmarks on sizes that I have already implemented: 9, 16, 64, 128

I had the following benchmark setup:

const LENGTHS: [usize; 4] = [9, 16, 64, 128];

fn bench_naive(c: &mut Criterion) {
    for len in &LENGTHS {
        let mut planner = rustfft::FftPlanner::new();
        let fft: Arc<dyn Fft<f32>> = planner.plan_fft_forward(*len);
        assert_eq!(fft.len(), *len);


        let input = vec![Complex::zero(); *len];
        let mut output = vec![Complex::zero(); *len];
        let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len() + len];

        c.bench_function(&format!("naive-{}", len), |b| b.iter(|| {
            fft.process_outofplace_with_scratch_naive(&input, &mut output, &mut scratch);
        }));
    }
}

fn bench_immut(c: &mut Criterion) {
    for len in &LENGTHS {
        let mut planner = rustfft::FftPlanner::new();
        let fft: Arc<dyn Fft<f32>> = planner.plan_fft_forward(*len);
        assert_eq!(fft.len(), *len);


        let input = vec![Complex::zero(); *len];
        let mut output = vec![Complex::zero(); *len];
        let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len() + len];

        c.bench_function(&format!("immut-{}", len), |b| b.iter(|| {
            fft.process_outofplace_with_scratch_immut(&input, &mut output, &mut scratch);
        }));
    }
}

These were my results:

naive-9                 time:   [11.833 ns 11.840 ns 11.850 ns]
immut-9                 time:   [9.2378 ns 9.2417 ns 9.2460 ns]

naive-16                time:   [12.321 ns 12.325 ns 12.331 ns]
immut-16                time:   [9.0788 ns 9.0825 ns 9.0867 ns]

naive-64                time:   [75.728 ns 75.768 ns 75.810 ns]
immut-64                time:   [69.823 ns 69.850 ns 69.879 ns]

naive-128               time:   [153.31 ns 153.37 ns 153.44 ns]
immut-128               time:   [142.04 ns 142.10 ns 142.17 ns]

I also compared this against the current mutable input process_outofplace_with_scratch, and you were right that there was no difference compared to the immutable function. The main benefit to immutability is if you need to reuse the input after the FFT is done.

michaelciraci · 2025-03-09T19:37:49Z

I also finished the implementation of mixed radix, and tested with 100, 125, 225:

naive-100               time:   [259.31 ns 259.60 ns 259.89 ns]
immut-100               time:   [251.03 ns 251.41 ns 251.81 ns]
mut-100                 time:   [250.04 ns 250.30 ns 250.60 ns]

naive-125               time:   [636.50 ns 636.91 ns 637.32 ns]
immut-125               time:   [625.15 ns 625.92 ns 626.81 ns]
mut-125                 time:   [622.76 ns 623.39 ns 624.38 ns]

naive-225               time:   [574.95 ns 575.23 ns 575.64 ns]
immut-225               time:   [553.02 ns 553.30 ns 553.68 ns]
mut-225                 time:   [554.14 ns 554.63 ns 555.32 ns]

It appears to me the immutable methods for the butterflies clearly beat out the "naive" default trait implementation. For mixed radix it does not provide the speed boost I was hoping for, although it does still edge out the naive default trait implementation.

I'm not sure what you'd like to do with all this info.

Perhaps the best path forward is to provide the default trait implementation, and override it for algorithms where the immutability would clearly provide a benefit (butterfly for example)? You're the maintainer, so whatever you think is best. You would also have more of an intuition to what algorithms would benefit from overriding the default trait implementation.

HEnquist · 2025-03-09T20:25:38Z

Are you thinking of just something like this?

Yes that's exactly what I had in mind!

Perhaps the best path forward is to provide the default trait implementation, and override it for algorithms where the immutability would clearly provide a benefit (butterfly for example)?

That sounds like a good way forward. I suggested the benches to get an idea of how much performance the naive immutable approach costs, compared to doing it "properly" with much larger changes.

I took a quick look at Radix-4 (arguably the most important algorithm since it handles all the very common power of twos too large for butterflies) and it doesn't look like it needs mutable input, so it would be a good candidate to add an override to.

You're the maintainer, so whatever you think is best.

No that is @ejmahler. I have spent quite some time working on this library and try to help out with issues etc whenever I can, but I'm just a contributor.

michaelciraci · 2025-03-09T20:55:20Z

No that is @ejmahler. I have spent quite some time working on this library and try to help out with issues etc whenever I can, but I'm just a contributor.

My mistake; I noticed you were very active in this repo so thought maybe you were a maintainer as well. Regardless, I'll go down that path and look at Radix-4, unless @ejmahler prefers a different direction.

ejmahler · 2025-03-10T00:22:01Z

One problem with the default impl approach is that the fft algorithms have to request scratch from the caller, and the default impl doesn't add that requested scratch. We could add another method to the fft trait that returns the required immutable out of place scratch, but I find it pretty difficult keeping even two in my head when writing new fft algorithms, so I have concerns about adding a third.

I see a few options, each with their own tradeoffs. I don't really have a preferred option as I start writing this, I just want to get my thoughts out:

Status quo. In the current approach, it's less convenient and less performant to compute a FFT where the input buffer isn't mutable.
Just make the existing out of place api take an immutable input, and adjust all existing algorithms to require extra scratch. The nice thing here is that some algorithms (including the most common usage of radix4 and most butterflies) wouldn't need to change at all, so the happy path would avoid having extra bugs introduced and would avoid a performance or memory requirement hit.
Process_inplace for some things like MixedRadix would have to request extra scratch, and actually use the out of place transform internally - and instead of bouncing data back and forth between the input and output, they'd have to bounce data back and forth between the output and scratch, and the overall effect would be much worse cache locality, and exploiting cache locality is one of the main reasons those algorithms exist, so this would hurt the non happy path of powers of two.
The approach currently implemented in the PR of adding an entirely new API endpoint. This would certainly do the best at preserving the performance of existing approaches, and would let each algorithm do the best possible job of handling the data provided to it. The downside is that we need an entirely new FFT implementation for a very significant number of algorithms, increasing the maintenance load.
It's not as bad as I originally thought, because AVX mixed radix is done mostly with macros, almost all butterflies don't need to change at all except for the boilerplate macros, our bluestein's out of place mutable implementations can just forward to the mutable versions because they don't modify the input. But it's still a significant amount of new code.
The default impl approach which requests scratch from the caller and copies, and then we can override it for radix4 and butterflies and other things that don't need the input. This is a great compromise as far as project complexity goes, but has the downside of the extra added complexity needed for the scratch method. Perhaps we could also add a default implementation of that, which just calls get_outofplace_scratch_len() and adds self.len()? And anything that overrides the default impl also overrides that method.
A default impl approach that just allocates a scratch vec internally. This would nicely mirror our process() default impl that allocates scratch internally. And then Radix4 and butterflies etc can override it. On the most common path of radix4, this wouldn't change perf at all, and on less common paths, it's essentially a convenience method for what you need to do anyways with the status quo, and we can provide the same guidance that if you're calling it multiple times and want to avoid the allocation, call get_outofplace_scratch_len instead. Although that might steer people doing power of twos towards it that don't benefit at all from that approach.

Now that I've written all of that out, I think if I could rank the approaches in order preference, I'd sort them #4, #5, #1, #2, #3.

It's tempting to say I just prefer the status quo, but we've talked about radix4 and how it's A: the most common path, and B: already doesn't need to overwrite the input except in rare cases, so if we take approach #4 or #5 then we're making the library strictly better for the most common user, which I find to be a pretty compelling argument.

So how does the following approach sound?

Give the method a default impl that copies data around and then calls an existing method. I think I'd like to call it process_immutable_with_scratch because our existing method names are already too long without adding another word.
Add another method - get_immutable_scratch_len, also with a default impl
Override both methods for the algorithms that need it. Theoretically we could override it for every single algorithm to have the optimal implementation, but I think we'll get the most bang for our buck by doing it for the following:
- Scalar, Sse, Neon, Wasm Radix4 - these would need subtly different versions for immutable vs mutable out of place, because of the way it uses the input as scratch for its inner fft
- Radix3, RadixN - same as above
- Scalar, Sse, Neon, Wasm, Avx Butterflies - these just need the mutable version to forward to the immutable version after data validation
- Scalar, Avx Bluestein's- (The mutable version should also just forward to the immutable version after doing data validation)

The last question is what the default implementation of process_immutable_with_scratch should be. Here are the two choices i see, and of the two, i prefer the inplace version, but let me know what you two think:

fn process_immutable_with_scratch(
        &self,
        input: &[Complex<T>],
        output: &mut [Complex<T>],
        scratch: &mut [Complex<T>],
    ) {
        output.copy_from_slice(input);
        self.process_with_scratch(output, scratch);
    }
fn get_immutable_scratch_len(&self) -> usize {
    self.get_inplace_scratch_len()
}

fn process_immutable_with_scratch(
        &self,
        input: &[Complex<T>],
        output: &mut [Complex<T>],
        scratch: &mut [Complex<T>],
    ) {
        let (mut input_scratch, scratch) = scratch.split_at_mut(input.len());
        input_scratch.copy_from_slice(input);
        self.process_outofplace_with_scratch(&mut input_scratch, output, scratch);
    }
fn get_immutable_scratch_len(&self) -> usize {
    self.get_outofplace_scratch_len() + self.len()
}

ejmahler · 2025-03-10T00:30:52Z

Actually, feel free to skip it for RadixNfor now, because i don't even know if it's gonna make it into the next release. I wish I had put it on a branch, in retrospect

michaelciraci · 2025-03-10T01:42:23Z

I think that suggested approach makes sense.

Regarding the default impl, I prefer the default impl of an in place as well. The scratch length would then match what is already existing.

michaelciraci · 2025-03-10T02:36:52Z

Before I make too much progress, I just wanted to put this here to get any feedback before I get too far down this road to make sure you had the same idea in mind: master...michaelciraci:RustFFT:f632083c63b9aee50319946fa1c3b639298ccfd2

ejmahler · 2025-03-10T05:32:46Z

Yeah so far it looks good. I was imagining that the butterfly boilerplate could be a lot shorter, but since the data validation is woven in and the data validation mentions specific method names, that might be harder than i thought.

michaelciraci · 2025-04-28T01:57:43Z

Implementation in PR here: #157

michaelciraci added 2 commits March 8, 2025 22:21

Changing array_utils

48b1873

Adding immutable api

dae4c82

michaelciraci mentioned this pull request Mar 9, 2025

Don't require process_outofplace_with_scratch input to be mutable #150

Closed

Adding neon

346a0d9

michaelciraci added 2 commits March 9, 2025 14:44

Adding mixed radix implementation

b37d080

Adding implementation for bluesteins and good thomas

2f346b4

michaelciraci added 5 commits March 9, 2025 15:56

Removing overridden impl

7eb2653

Adding more AVX instructions

edad1ca

Working avx

f2a9ca6

Removing blanket impl

0b0f66f

Correcting 100 test len

7a3b6c9

michaelciraci closed this Apr 28, 2025

michaelciraci mentioned this pull request Apr 28, 2025

Adding process_immutable_with_scratch method #157

Merged

michaelciraci mentioned this pull request Apr 28, 2025

Correct Uneven Buffer Lengths #158

Closed

Conversation

michaelciraci commented Mar 9, 2025

Uh oh!

HEnquist commented Mar 9, 2025

Uh oh!

michaelciraci commented Mar 9, 2025

Uh oh!

HEnquist commented Mar 9, 2025

Uh oh!

michaelciraci commented Mar 9, 2025

Uh oh!

michaelciraci commented Mar 9, 2025

Uh oh!

michaelciraci commented Mar 9, 2025

Uh oh!

HEnquist commented Mar 9, 2025

Uh oh!

michaelciraci commented Mar 9, 2025

Uh oh!

ejmahler commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejmahler commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michaelciraci commented Mar 10, 2025

Uh oh!

michaelciraci commented Mar 10, 2025

Uh oh!

ejmahler commented Mar 10, 2025

Uh oh!

michaelciraci commented Apr 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ejmahler commented Mar 10, 2025 •

edited

Loading

ejmahler commented Mar 10, 2025 •

edited

Loading