-
Notifications
You must be signed in to change notification settings - Fork 29
Parallel sparse multiplication #125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
noir-r1cs/src/sparse_matrix.rs
Outdated
| ); | ||
|
|
||
| let intermediate_multiplication = | ||
| LockFreeArray::new(vec![(0, FieldElement::zero()); rhs.matrix.num_entries()]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate that it requires allocating a temporary vector larger than the output.
noir-r1cs/src/sparse_matrix.rs
Outdated
| // wouldn't know the row a value belongs to. That's why the rows drive the | ||
| // iterator below. | ||
| // - Acquiring a mutex per column in the result was too expensive (even with | ||
| // parking_lot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I had in mind was splitting the work over the mutable output vector, and creating a iter_col method on the matrix. (Which is substantially more complicated than iter_row, but should be doable).
Can we explore this option?
recmo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after comments addressed.
| result[j] += value * self[i]; | ||
| } | ||
|
|
||
| let chunk_size = result.len().div_ceil(num_threads); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the perf delta is small (e.g. <5%) I think I prefer a solution that let's rayon decide the chunk size. I don't like hard dependencies on the number of available threads. The number of threads does not say much, as we might be doing other work in parallel.
Instead we should pick the workload size such that it amortizes the overhead, while still allowing parallelization for large problem sizes. To approximate this I like to pick 'whatever subproblem fits in L1 cache' as problem size. And this in turn is approximated with the workload_size::<F>(result.len()) function.
|
|
||
| let chunk_size = result.len().div_ceil(num_threads); | ||
|
|
||
| // In microbenchmarks par_iter_mut.chunks outperforms par_chunks_mut slightly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a small difference this might be a compiler quirk. I'd stick with par_chunks_mut for simplicity and maintainability.
If .par_iter_mut().chunks(..) is faster than this is basically a bug somewhere as the former provides the libraries/compiler with strictly more information. This bug is likely to be fixed at some point.
| .enumerate() | ||
| .for_each(|(chunk_number, mut chunk)| { | ||
| let base = chunk_number * chunk_size; | ||
| let col_range = base..base + chunk_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let col_range = base..base + chunk.len().
Otherwise col_range will be out of bounds when chunk_size doesn't divide result.len(). You are protected here from this bug because col will only be in-range, but better to do it right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let base = chunk_number * chunk_size should still be correct as only the last chunk is not exactly chunk_size (but please confirm with rayon docs, and maybe leave a comment explaining correctness.)
No description provided.