-
Notifications
You must be signed in to change notification settings - Fork 35
Implement Small-Value Sum-Check Optimization (Algorithm 6) #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Introduce UdPoint, UdHatPoint, UdTuple, and ValueOneExcluded types in src/lagrange.rs for representing evaluation domains U_d and Û_d used in the small-value sumcheck optimization.
Implements LagrangeEvaluatedMultilinearPolynomial with
from_multilinear() factory method that extends evaluations from {0,1}^n
to U_d^n.
sumcheck optimization (Algorithm 6) Introduces RoundAccumulator and SmallValueAccumulators for the small-value sumcheck optimization. Uses flat Vec<[Scalar; D]> storage with const generic D for cache efficiency and vectorizable merge operations in parallel fold-reduce.
Parameterize UdPoint, UdHatPoint, UdTuple, and LagrangeEvaluatedMultilinearPolynomial with const generic D to enable: - Compile-time enforcement that domain types match accumulator degree - Debug assertions for bounds checking (v < D in constructors) - Elimination of runtime base parameter from to_flat_index() This prevents mixing domain sizes at compile time and catches out-of-bounds errors in debug builds.
Implement AccumulatorPrefixIndex and compute_idx4() which maps evaluation prefixes β ∈ U_d^ℓ₀ to accumulator contributions by decomposing β into prefix v, coordinate u ∈ Û_d, and binary suffix y.
Extracts strided polynomial evaluations for all binary prefixes b ∈
{0,1}^ℓ₀ given a fixed suffix, bridging full polynomials to Procedure 6
(Lagrange extension).
Added a parallel build_accumulators that binds suffixes, extends prefixes to the Ud domain, applies the ∞/Cz rule, and routes contributions via cached idx4 with E_in/E_out weighting. Expanded accumulator tests with a naive cross-check, ∞ handling, and binary-β zero behavior to validate correctness. Cleaned up dead-code allowances now that the code paths are used.
Added explicit MSB-first checks for eq table generation, gather_prefix_evals stride/pattern, and bind_poly_var_top to ensure “top” binds the MSB.These tests catch silent index/order regressions across components.
Compute ℓ_i(X) = eqe(w[<i], r[<i]) · eqe(w_i, X) values for sum-check rounds. Compute ℓ_i(0)=α_i(1−w_i), ℓ_i(1)=α_i w_i, ℓ_i(∞)=α_i(2w_i−1) for sum-check rounds
Replace range-indexed loops and a redundant closure with iterator forms
Add eq-round linear factor utilities and accumulator evaluation to derive t_i and build s_i polynomials.
Track R_i and ℓ_i state to compare accumulator evals with EqSumCheckInstance rounds.
indexing Switch Spartan t_i to D=2 aliases/tests, precompute idx4 prefix/suffix data, and flatten accumulator caches to cut allocations.
Csr (Compressed Sparse Row) stores variable-length lists with 2 allocations instead of N+1, improving cache locality. Replaces ad-hoc offsets/entries arrays in build_accumulators
- Add prove_cubic_with_three_inputs_small_value combining small-value optimization for first ℓ₀ rounds with eq-poly optimization for remaining - Introduce SPARTAN_T_DEGREE constant to centralize polynomial degree parameter - Add sumcheck_sweep.rs examples for performance comparison
build_accumulators The new from_boolean_evals_with_buffer_reusing method takes caller-provided scratch buffers and alternates between them during extension. This reduces allocations from O(num_x_in × num_x_out) per call to O(num_threads) buffers allocated once per thread.
variants
Spartan version (D=2) skips binary betas since satisfying witnesses have
Az·Bz = Cz on {0,1}^n. Generic version supports arbitrary polynomial
products.
Adds a new example that tests prove_cubic_with_three_inputs and prove_cubic_with_three_inputs_small_value produce identical proofs when used with a real SHA256 circuit (Algorithm 6 validation). Changes: - Add PartialEq, Eq derive to SumcheckProof for proof comparison - Add extract_outer_sumcheck_inputs helper to SpartanSNARK - Add examples/sumcheck_sha256_equivalence.rs
Implement the small × large multiplication optimization from "Speeding
Up Sum-Check Proving" using Barrett reduction for ~3× speedup over naive
field multiplication.
Key changes:
- Add SmallValueField trait for type-safe i32/i64 small-value
operations
- Implement Barrett reduction for Pallas Fp and Fq (sl_mul, isl_mul)
- Add SpartanAccumulatorInput trait to unify field and i32 witness
handling
- Make LagrangeEvaluatedMultilinearPolynomial generic over element
type
- Update sumcheck prover to accept separate i32 witness polynomials
- Clean up MultilinearPolynomial<i32>: remove unused
from_u32/from_u64/from_field
2828f04 to
67674c4
Compare
evaluations
Replace raw arrays and ad-hoc structs with proper abstractions for U_d =
{∞, 0, 1, ..., D-1} and Û_d = U_d \ {1} evaluation domains. Remove
EqRoundValues in favor of UdEvaluations<F, 2>.
- Delete unused constructor/predicate methods from UdPoint and UdHatPoint - Move test-only methods (alpha, prefix_len, suffix_len, extend_from_boolean) to cfg(test) impl blocks - Add CachedPrefixIndex struct with From impl to accumulator_index.rs - Remove unused QuadraticTAccumulatorPrefixIndex type alias - Delete unused eq_factor_alpha method from sumcheck
Hoist scratch buffers to thread-local state in build_accumulators_spartan and build_accumulators. Previously, 5 vectors were allocated on every x_out iteration; now allocations happen once per Rayon thread subdivision. - Add extend_in_place to LagrangeEvaluatedMultilinearPolynomial (avoids .to_vec()) - Add SpartanThreadState and GenericThreadState structs for buffer reuse - Extract thread state structs to thread_state_accumulators module Reduces allocations from O(num_x_out × num_x_in) to O(num_threads).
Move the witness polynomial abstraction trait from accumulators.rs to its own module for better code organization. Rename from SpartanAccumulatorInput to SpartanAccumulatorInputPolynomial to clarify that it abstracts over multilinear polynomial representations (field elements vs small values).
- compute_idx4: derive l0 from beta.len() instead of taking as parameter - csr: remove unused new() and push_empty(), move test helpers to #[cfg(test)] - accumulators: add #[inline] to num_prefixes() - examples: switch to tracing and #[instrument] for cleaner logging
- accumulator_index: add phase comments explaining prefix/suffix computation - accumulators: use filter() instead of continue for beta_has_infinity check - lagrange: document stride calculations in extend_in_place - small_field: extract try_field_to_small_impl to deduplicate Fp/Fq impls - small_field: document Barrett reduction loop bound (at most 2 iterations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements Algorithm 6 ("Small-Value Sum-Check with Eq-Poly Optimization") from the paper "Speeding Up Sum-Check Proving" by Bagad, Dao, Domb, and Thaler. The optimization targets Spartan's first sum-check invocation where witness polynomial evaluations are small integers, achieving significant prover speedups (1.5-1.64×) by replacing expensive field multiplications with cheaper native integer operations.
Key changes:
- Introduces Barrett-optimized field arithmetic for multiplying small integers with field elements
- Implements Lagrange domain extension for efficient round polynomial computation
- Adds accumulator data structures for precomputing sum-check values
- Integrates the optimization into the existing sum-check protocol
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/small_field.rs | Barrett-optimized arithmetic trait for small-value × field-element operations |
| src/lagrange.rs | Lagrange domain types and multilinear polynomial extension logic |
| src/accumulators.rs | Accumulator data structures and Procedure 9 implementation |
| src/accumulator_index.rs | Index mapping for distributing evaluation prefixes to accumulators |
| src/sumcheck.rs | Integration of Algorithm 6 into the sum-check protocol |
| src/thread_state_accumulators.rs | Thread-local buffers to reduce allocations in parallel execution |
| src/spartan_accumulator_input_polynomial.rs | Trait abstraction for witness polynomials |
| src/polys/multilinear.rs | Generic multilinear polynomial type and prefix gathering |
| src/eq_linear.rs | Utilities for eq-polynomial round factors |
| src/csr.rs | Compressed sparse row storage for variable-length lists |
| examples/sumcheck_sweep.rs | Benchmark sweep across polynomial sizes |
| examples/sumcheck_sha256_equivalence.rs | Equivalence test with SHA-256 circuit |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| //! We currently implement a non-preprocessing version of Spartan | ||
| //! that is generic over the polynomial commitment and evaluation argument (i.e., a PCS). | ||
| #![deny( | ||
| warnings, |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unused lint has been removed from the deny list. This change allows unused code warnings to be suppressed, which could hide legitimate issues. Consider keeping unused in the deny list and using targeted #[allow(dead_code)] attributes where specific exceptions are needed.
| warnings, | |
| warnings, | |
| unused, |
src/small_field.rs
Outdated
|
|
||
| #[inline] | ||
| fn small_from_u32(val: u32) -> i32 { | ||
| val as i32 |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conversion from u32 to i32 using as casting is unsafe when the value exceeds i32::MAX. This could lead to incorrect negative values. Consider using checked conversion or documenting the assumption that val <= i32::MAX.
| val as i32 | |
| i32::try_from(val).expect("small_from_u32: value does not fit in i32") |
src/small_field.rs
Outdated
| Self::from(val as u64) | ||
| } else { | ||
| -Self::from((-val) as u64) |
Copilot
AI
Dec 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The negation using (-val) as u64 is unsafe for i32::MIN (-2147483648) because -i32::MIN overflows. Use val.wrapping_neg() as u64 or val.unsigned_abs() instead to handle this edge case correctly.
| Self::from(val as u64) | |
| } else { | |
| -Self::from((-val) as u64) | |
| Self::from(val.unsigned_abs() as u64) | |
| } else { | |
| -Self::from(val.unsigned_abs() as u64) |
Direct fixes: - Remove redundant closures and unnecessary casts - Replace manual_contains with .contains() - Replace manual_is_multiple_of with .is_multiple_of() - Replace useless vec! with array literal Suppressions (intentional patterns): - needless_range_loop: loop index serves dual purpose (indexing + computation) - identity_op/erasing_op: operations like `0 * base * base` document index formulas Typos: - Rename `ein` to `e_in_eval` for clarity (eq evaluation at input point)
Replace per-iteration modular reductions with accumulated wide-integer arithmetic, reducing once per beta instead of once per x_in iteration. Key changes: - Add WideLimbs<N> for wide unsigned integer arithmetic (6/8 limbs) - Refactor SmallValueField to be generic over small value type (i32/i64) - Add UnreducedMontInt types for delayed reduction in Montgomery form - Replace SpartanAccumulatorInputPolynomial with MatVecMLE trait - Optimize eq polynomial table computation (1 mul instead of 2 per element) - Update benchmark to compare i32/i64 vs i64/i128 variants
- Add mac() helper for fused multiply-accumulate, eliminating temporary arrays in unreduced_mont_int_mul_add (4 implementations) - Subtract in limb space before reduction via sub_mag(), saving one Barrett reduction per signed accumulator - Replace large e_out tables with JIT-computed eyx scratch buffers, reducing eq table memory 7× and improving cache locality - Add unreduced_is_zero() fast path to skip expensive modular reduction - Precompute betas_with_infty indices to avoid filter in inner loop - Use barrett_reduce_6_* directly for i128 products instead of padding to 8 limbs (saves 8 wasted multiplications per isl_mul call)
propagation Replace mac(acc, 0, 0, carry) calls with simple overflowing_add to avoid unnecessary u128 multiply-add pipeline for pure carry propagation. Also add #[inline(always)] to hot path functions to ensure full inlining.
- Apply rustfmt formatting fixes in accumulators.rs - Fix clippy manual_is_multiple_of warning in test code
Introduce circuit gadgets optimized for small-value sumcheck optimization: - SmallMultiEq: Batches equality constraints with bounded coefficients, flushing at MAX_COEFF_BITS (31) instead of bellpepper's ~237. This keeps constraint coefficients within i32 bounds for the small-value optimization. - SmallUInt32: 32-bit unsigned integer gadget using SmallMultiEq for carry constraints in addmany operations. - small_sha256: SHA-256 implementation using the above gadgets, producing circuits where Az and Bz values fit in i32. - Update sumcheck_sha256_equivalence example to use bellpepper's Circuit trait for constraint counting, comparing SmallSha256 vs bellpepper SHA-256. The tradeoff: SmallSha256 generates ~17% more R1CS constraints due to more frequent MultiEq flushing, but enables the small-value sumcheck optimization. Add 16-bit limbed addition for i32 small-value optimization SmallUInt32::addmany produces coefficients up to 2^34, exceeding i32 bounds. Splitting into 16-bit limbs reduces max coefficient to 2^18, enabling i32/i64 small-value sumcheck for SHA-256. - Add SmallValueConfig trait with Small32 (i32/i64) and Small64 (i64/i128) - Implement addmany_limbed using two constraints per addition - Update SmallMultiEq to be generic over config - Fix example to use config-specific bounds check
- Add examples/sha256_chain_benchmark.rs comparing original vs small-value sumcheck performance on SHA-256 hash chains - CSV output includes witness synthesis time, sumcheck times, speedup, and witness percentage of total proving time - CLI support: single <num_vars> for profiling, range-sweep for benchmarks - Add small_sha256_with_prefix() for chaining multiple SHA-256 hashes with unique constraint namespaces - Fix SmallValueField<i64> generic in lagrange.rs - Fix unused variable warning in msm.rs
|
@wu-s-john please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
Split SmallValueField into two traits for better separation of concerns: - SmallValueField: core small-value operations (ss_mul, sl_mul, isl_mul) - DelayedReduction: unreduced accumulator operations for hot paths Rename types for clarity: - UnreducedMontInt → UnreducedFieldInt (field × integer products) - UnreducedMontMont → UnreducedFieldField (field × field products) Add FieldReductionConstants trait to deduplicate Barrett/Montgomery reduction: - Consolidates Fp/Fq constants (MODULUS, R256-R512, MONT_INV) - Generic reduction functions monomorphized at compile time for zero overhead - Comprehensive documentation explaining R constants (2^k mod p) Performance and cleanup: - Add ext_buf_idx scratch buffer to avoid Vec allocation in accumulator hot loop - Remove unused OrderedVariable from shape_cs modules (~140 lines) - Remove unused build_univariate_round_evals from sumcheck (~40 lines) - Add log2_constraints column to benchmark CSV output
Split the 2,367-line small_field.rs into a proper module structure: - small_field/small_value_field.rs: SmallValueField trait - small_field/delayed_reduction.rs: DelayedReduction trait - small_field/barrett.rs: Barrett/Montgomery reduction functions - small_field/impls.rs: Fp/Fq implementations and tests - small_field/mod.rs: re-exports and helper functions Moved batching configuration types (NoBatching, Batching<K>, BatchingMode, SmallMultiEqConfig, I32NoBatch, I64Batch21) from small_field to gadgets/small_multi_eq.rs where they logically belong, since they're specifically for constraint batching in SmallMultiEq. Added detailed documentation for I64Batch21 explaining why K=21 is the safe maximum: with SHA-256-like circuits having ~200 terms and 2^34 positional coefficients, batching 21 constraints keeps the worst-case magnitude (2^62) under the i64 signed limit (2^63).
contributions Refactors shared logic between Spartan and generic accumulator builders.
878e7b0 to
406b59e
Compare
Improves type safety and self-documentation by replacing (bool, [u64; N]) with an explicit enum indicating whether the result is positive (a >= b) or negative (a < b).
Move wide_limbs.rs content and limb arithmetic from barrett.rs into a unified small_field/limbs.rs module for delayed modular reduction.
Split monolithic lagrange.rs (1667 lines) into focused submodules:
- domain.rs: LagrangePoint, LagrangeHatPoint, LagrangeIndex
- evals.rs: LagrangeEvals, LagrangeHatEvals
- basis.rs: LagrangeBasisFactory, LagrangeCoeff
- extension.rs: LagrangeEvaluatedMultilinearPolynomial
- accumulator.rs: RoundAccumulator, LagrangeAccumulators
- accumulator_builder.rs: build_accumulators_spartan,
build_accumulators
Consolidate related files into the module:
- accumulator_index.rs → index.rs
- thread_state_accumulators.rs → thread_state.rs
- eq_linear.rs → eq_round.rs
Simplify extend_in_place API: use std::mem::swap to ensure result is
always in first buffer, eliminating conditional buffer selection at
call sites. Rename buf_a/b to buf_curr/scratch for clarity.
- Refactor SmallMultiEq from struct to trait with NoBatchEq and
BatchingEq<K> implementations
- Add addmany module with limbed (i32) and full (i64) addition
algorithms
- Deduplicate SHA-256 circuits into examples/circuits/sha256/ module
- Update small_uint32 and small_sha256 to use SmallMultiEq trait
phase - Extend MatVecMLE trait with UnreducedFieldField type for F×F accumulation - Add unreduced bucket accumulators to SpartanThreadState - Replace eyx precomputation with direct e_y access and z_beta = ex * tA_red - Keep unreduced across all x_out iterations and merge without reduction - Pre-compute beta values to eliminate closure overhead in scatter loop - Final Montgomery reduction only once per bucket after thread merge This reduces Montgomery reductions from ~7000+ per x_out to ~26 total for typical parameters (l0=3, 128 x_outs).
savings Replace asymmetric l/2 split with balanced ceil/floor split. This reduces precomputation cost (e.g., 36→24 for l=10, l0=3), enables odd number of rounds, and improves cache utilization by making e_xout smaller.
Implement Small-Value Sum-Check Optimization (Algorithm 6)
Summary
This PR implements Algorithm 6 ("Small-Value Sum-Check with Eq-Poly Optimization") from the paper "Speeding Up Sum-Check Proving" by Bagad, Dao, Domb, and Thaler. The optimization targets Spartan's first sum-check invocation where witness polynomial evaluations are small integers (fitting in i32/i64), enabling significant prover speedups by replacing expensive field multiplications with cheaper native integer operations.
Key Insight
In the sum-check protocol, round 1 computations involve only small values (the original witness evaluations). From round 2 onward, evaluations become "large" due to binding to random verifier challenges. Algorithm 6 delays this binding using Lagrange interpolation, computing accumulators over small values in the first ℓ₀ rounds before switching to the standard linear-time prover.
Multiplication Cost Hierarchy:
For Spartan with degree-2 polynomials, Algorithm 6 reduces ll multiplications from O(N) to O(N/2^ℓ₀) at the cost of O((3/2)^ℓ₀ · N) ss multiplications.
Benchmarks
Measured on M1 Max MacBook Pro (10 cores, 64GB RAM) with
jemalloc.Note:
halo2curves/asmis not enabled (unavailable on Apple Silicon).Key observations:
Delayed Modular Reduction (i32 vs i64)
Benchmarks comparing i32 and i64 small value types with delayed modular reduction:
Key observations:
SHA-256 Chain Benchmark
To demonstrate real-world applicability, we benchmark proving SHA-256 hash chains. This workload approximates a major component of Solana light client verification.
Key observations:
Solana Light Client Comparison
A Solana light client verifying block finality requires:
SHA-256 equivalent cost:
Implementation
Core Components
SmallValueFieldtrait (src/small_field.rs)SmallValue(i32) andIntermediateSmallValue(i64) typessl_mulandisl_mulfor BN254/BLS12-381 (~3× faster than ll)Lagrange Domain Extension (
src/lagrange.rs)LagrangeEvaluatedMultilinearPolynomial<T, D>for extending boolean evaluations to U_d = {∞, 0, 1, ..., d-1}extend_in_placewith ping-pong buffersgather_prefix_evalsfor efficient prefix collection (Procedure 6)Accumulator Data Structures (
src/accumulators.rs,src/accumulator_index.rs)SmallValueAccumulators<S, D>storing A_i(v, u) with O(1) indexing viaUdTupleidx4mapping (Definition A.5) for distributing products to correct accumulatorsUdEvaluationsandUdHatEvaluationswrappersProcedure 9 Implementation (
src/accumulators.rs)build_accumulators_spartan: Optimized for Spartan's Az·Bz structurebuild_accumulators: Generic version for arbitrary polynomial productsThread-Local Buffer Reuse (
src/thread_state_accumulators.rs)SpartanThreadStateandGenericThreadStateeliminate O(num_x_out) allocationsSum-Check Integration (
src/sumcheck.rs)SmallValueSumCheck::from_accumulatorsfactory methodAlgorithm Flow
Test Plan
cargo test test_build_accumulators- Verifies accumulator constructioncargo test test_small_value- SmallValueField arithmetic correctnesscargo test lagrange- Lagrange extension and interpolationcargo test sumcheck- Full sum-check protocol equivalencecargo clippy- No warningsexamples/sumcheck_sha256_equivalence.rs- Verifies new method produces identical proofs to baselineexamples/sha256_chain_benchmark.rs- SHA-256 chain proving with CSV outputReferences