Skip to content

Conversation

@xrvdg
Copy link
Collaborator

@xrvdg xrvdg commented Jun 16, 2025

No description provided.

@xrvdg xrvdg marked this pull request as draft June 16, 2025 15:39
);

let intermediate_multiplication =
LockFreeArray::new(vec![(0, FieldElement::zero()); rhs.matrix.num_entries()]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate that it requires allocating a temporary vector larger than the output.

// wouldn't know the row a value belongs to. That's why the rows drive the
// iterator below.
// - Acquiring a mutex per column in the result was too expensive (even with
// parking_lot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I had in mind was splitting the work over the mutable output vector, and creating a iter_col method on the matrix. (Which is substantially more complicated than iter_row, but should be doable).
Can we explore this option?

Copy link
Contributor

@recmo recmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after comments addressed.

result[j] += value * self[i];
}

let chunk_size = result.len().div_ceil(num_threads);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the perf delta is small (e.g. <5%) I think I prefer a solution that let's rayon decide the chunk size. I don't like hard dependencies on the number of available threads. The number of threads does not say much, as we might be doing other work in parallel.

Instead we should pick the workload size such that it amortizes the overhead, while still allowing parallelization for large problem sizes. To approximate this I like to pick 'whatever subproblem fits in L1 cache' as problem size. And this in turn is approximated with the workload_size::<F>(result.len()) function.


let chunk_size = result.len().div_ceil(num_threads);

// In microbenchmarks par_iter_mut.chunks outperforms par_chunks_mut slightly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a small difference this might be a compiler quirk. I'd stick with par_chunks_mut for simplicity and maintainability.

If .par_iter_mut().chunks(..) is faster than this is basically a bug somewhere as the former provides the libraries/compiler with strictly more information. This bug is likely to be fixed at some point.

.enumerate()
.for_each(|(chunk_number, mut chunk)| {
let base = chunk_number * chunk_size;
let col_range = base..base + chunk_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let col_range = base..base + chunk.len().

Otherwise col_range will be out of bounds when chunk_size doesn't divide result.len(). You are protected here from this bug because col will only be in-range, but better to do it right.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let base = chunk_number * chunk_size should still be correct as only the last chunk is not exactly chunk_size (but please confirm with rayon docs, and maybe leave a comment explaining correctness.)

@recmo recmo changed the title Sparse multiplication Parallel sparse multiplication Jun 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants