Skip to content

Conversation

@wu-s-john
Copy link

@wu-s-john wu-s-john commented Dec 18, 2025

Implement Small-Value Sum-Check Optimization (Algorithm 6)

Summary

This PR implements Algorithm 6 ("Small-Value Sum-Check with Eq-Poly Optimization") from the paper "Speeding Up Sum-Check Proving" by Bagad, Dao, Domb, and Thaler. The optimization targets Spartan's first sum-check invocation where witness polynomial evaluations are small integers (fitting in i32/i64), enabling significant prover speedups by replacing expensive field multiplications with cheaper native integer operations.

Key Insight

In the sum-check protocol, round 1 computations involve only small values (the original witness evaluations). From round 2 onward, evaluations become "large" due to binding to random verifier challenges. Algorithm 6 delays this binding using Lagrange interpolation, computing accumulators over small values in the first ℓ₀ rounds before switching to the standard linear-time prover.

Multiplication Cost Hierarchy:

  • ss (small × small): Native i32/i64 multiplication (~1 cycle)
  • sl (small × large): Barrett-optimized multiplication (~9 base mults)
  • ll (large × large): Full Montgomery multiplication (~32 base mults)

For Spartan with degree-2 polynomials, Algorithm 6 reduces ll multiplications from O(N) to O(N/2^ℓ₀) at the cost of O((3/2)^ℓ₀ · N) ss multiplications.

Benchmarks

Measured on M1 Max MacBook Pro (10 cores, 64GB RAM) with jemalloc.
Note: halo2curves/asm is not enabled (unavailable on Apple Silicon).

cargo run --release --example sumcheck_sweep --features jem
num_vars n original (µs) small-value (µs) speedup
10 1,024 1,275 1,166 1.09×
12 4,096 1,575 1,322 1.19×
14 16,384 2,315 2,016 1.15×
16 65,536 4,922 3,847 1.28×
18 262,144 15,087 10,480 1.44×
20 1,048,576 46,783 28,491 1.64×
22 4,194,304 163,593 105,487 1.55×
24 16,777,216 658,282 439,680 1.50×

Key observations:

  • Speedup increases with problem size, peaking at 1.64× for n = 2²⁰
  • Consistent 1.5× speedup for large instances (n ≥ 2²²)
  • Small instances show modest gains due to fixed overhead of accumulator precomputation

Delayed Modular Reduction (i32 vs i64)

Benchmarks comparing i32 and i64 small value types with delayed modular reduction:

MAX_VARS=26 cargo run --release --example sumcheck_sweep
num_vars n original (µs) i32 small (µs) i64 small (µs) orig vs i32 orig vs i64
10 1,024 1,293 1,116 1,059 1.16× 1.22×
12 4,096 1,644 1,597 1,582 1.03× 1.04×
14 16,384 2,556 2,277 2,110 1.12× 1.21×
16 65,536 5,363 4,027 4,193 1.33× 1.28×
18 262,144 15,246 9,260 9,314 1.65× 1.64×
20 1,048,576 47,629 27,180 29,157 1.75× 1.63×
22 4,194,304 181,004 103,033 102,690 1.76× 1.76×
24 16,777,216 700,294 415,251 441,350 1.69× 1.59×
26 67,108,864 2,943,038 1,764,047 1,789,063 1.67× 1.65×

Key observations:

  • Both i32 and i64 variants achieve similar speedups (~1.65-1.76×) for large instances
  • i32 is slightly faster for n ≥ 2¹⁶ (narrower loads/stores)
  • i64 shows marginal advantage for small instances (n ≤ 2¹⁴)
  • Peak speedup of 1.76× at n = 2²² for both variants

SHA-256 Chain Benchmark

To demonstrate real-world applicability, we benchmark proving SHA-256 hash chains. This workload approximates a major component of Solana light client verification.

cargo run --release --no-default-features --example sha256_chain_benchmark
chain_length num_vars log₂(constraints) num_constraints witness_ms orig_sumcheck_ms small_sumcheck_ms total_ms speedup witness_pct
2 16 16 65,536 14 5 3 20 1.67× 70.0%
8 18 18 262,144 55 16 11 75 1.45× 73.3%
32 20 20 1,048,576 229 48 32 301 1.50× 76.1%
128 22 22 4,194,304 1,260 163 109 1,547 1.50× 81.4%
512 24 24 16,777,216 5,686 609 395 6,743 1.54× 84.3%
2048 26 26 67,108,864 17,015 2,857 1,677 22,116 1.70× 76.9%

Key observations:

  • 2048 SHA-256 hashes proven in ~22 seconds
  • Witness generation dominates at 70-84% of total proving time
  • Small-value sumcheck achieves consistent 1.45-1.70× speedup

Solana Light Client Comparison

A Solana light client verifying block finality requires:

Component Hash Function Count
Vote signature verification SHA-512 (Ed25519 internal) ~21 to ~1,588
Merkle shred verification SHA-256 ~108 to ~1,206
  • Ed25519 uses SHA-512 internally for challenge hashing
  • Finality requires ≥2/3 supermajority stake (~21-530 validators)
  • SHA-512 is ~1.5-2× more expensive than SHA-256 per hash

SHA-256 equivalent cost:

  • Solana SHA-256: ~1,206 hashes
  • Solana SHA-512: ~1,588 × 1.5-2 = ~2,382-3,176 SHA-256 equivalent
  • Total: ~3,588-4,382 SHA-256 equivalent
  • Our 2048-chain benchmark covers ~47-57% of Solana's worst-case proving requirement

Implementation

Core Components

  1. SmallValueField trait (src/small_field.rs)

    • Defines SmallValue (i32) and IntermediateSmallValue (i64) types
    • Barrett-optimized sl_mul and isl_mul for BN254/BLS12-381 (~3× faster than ll)
    • Overflow analysis ensuring correctness for typical witness bounds
  2. Lagrange Domain Extension (src/lagrange.rs)

    • LagrangeEvaluatedMultilinearPolynomial<T, D> for extending boolean evaluations to U_d = {∞, 0, 1, ..., d-1}
    • Zero-allocation extend_in_place with ping-pong buffers
    • gather_prefix_evals for efficient prefix collection (Procedure 6)
  3. Accumulator Data Structures (src/accumulators.rs, src/accumulator_index.rs)

    • SmallValueAccumulators<S, D> storing A_i(v, u) with O(1) indexing via UdTuple
    • idx4 mapping (Definition A.5) for distributing products to correct accumulators
    • Type-safe UdEvaluations and UdHatEvaluations wrappers
  4. Procedure 9 Implementation (src/accumulators.rs)

    • build_accumulators_spartan: Optimized for Spartan's Az·Bz structure
    • build_accumulators: Generic version for arbitrary polynomial products
    • Parallel fold-reduce with thread-local scratch buffers
  5. Thread-Local Buffer Reuse (src/thread_state_accumulators.rs)

    • SpartanThreadState and GenericThreadState eliminate O(num_x_out) allocations
    • Reduces allocator contention in parallel workloads
  6. Sum-Check Integration (src/sumcheck.rs)

    • SmallValueSumCheck::from_accumulators factory method
    • Round-by-round Lagrange coefficient multiplication (R_{i+1} = R_i ⊗ L_{U_d}(r_i))

Algorithm Flow

┌─────────────────────────────────────────────────────────────────────────┐
│  Precomputation: Build accumulators A_i(v, u) for i ∈ [ℓ₀]              │
│                                                                         │
│  For each x_out ∈ {0,1}^{ℓ/2-ℓ₀}:                                       │
│    For each x_in ∈ {0,1}^{ℓ/2}:                                         │
│      ein = eq(w_R, x_in) · eq(w_L, x_out)                              │
│      Extend Az/Bz prefixes to U_d^{ℓ₀} via Lagrange                    │
│      Accumulate products weighted by ein into A_i(v, u)                │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Rounds 1..ℓ₀: Compute s_i(X) = ⟨R_i, A_i(·, u)⟩ for u ∈ Û_d           │
│                R_{i+1} = R_i ⊗ (L_{U_d,k}(r_i))_{k∈U_d}                 │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Round ℓ₀+1: Streaming round (Algorithm 2) to bind to r_{1:ℓ₀}         │
│  Rounds ℓ₀+2..ℓ: Standard linear-time sum-check (Algorithm 1)          │
└─────────────────────────────────────────────────────────────────────────┘

Test Plan

  • cargo test test_build_accumulators - Verifies accumulator construction
  • cargo test test_small_value - SmallValueField arithmetic correctness
  • cargo test lagrange - Lagrange extension and interpolation
  • cargo test sumcheck - Full sum-check protocol equivalence
  • cargo clippy - No warnings
  • examples/sumcheck_sha256_equivalence.rs - Verifies new method produces identical proofs to baseline
  • examples/sha256_chain_benchmark.rs - SHA-256 chain proving with CSV output

References

Introduce UdPoint, UdHatPoint, UdTuple, and ValueOneExcluded types in
src/lagrange.rs for representing evaluation domains U_d and Û_d used in
the small-value sumcheck optimization.
Implements LagrangeEvaluatedMultilinearPolynomial with
from_multilinear() factory method that extends evaluations from {0,1}^n
to U_d^n.
sumcheck optimization (Algorithm 6)

Introduces RoundAccumulator and SmallValueAccumulators for the
small-value sumcheck optimization. Uses flat Vec<[Scalar; D]> storage
with const generic D for cache efficiency and vectorizable merge
operations in parallel fold-reduce.
Parameterize UdPoint, UdHatPoint, UdTuple, and
LagrangeEvaluatedMultilinearPolynomial with const generic D to enable:

- Compile-time enforcement that domain types match accumulator degree
- Debug assertions for bounds checking (v < D in constructors)
- Elimination of runtime base parameter from to_flat_index()

This prevents mixing domain sizes at compile time and catches
out-of-bounds errors in debug builds.
Implement AccumulatorPrefixIndex and compute_idx4() which maps
evaluation prefixes β ∈ U_d^ℓ₀ to accumulator contributions by
decomposing β into prefix v, coordinate u ∈ Û_d, and binary suffix y.
Extracts strided polynomial evaluations for all binary prefixes b ∈
{0,1}^ℓ₀ given a fixed suffix, bridging full polynomials to Procedure 6
(Lagrange extension).
Added a parallel build_accumulators that binds suffixes, extends
prefixes to the Ud domain, applies the ∞/Cz rule, and routes
contributions via cached idx4 with E_in/E_out weighting. Expanded
accumulator tests with a naive cross-check, ∞ handling, and binary-β
zero behavior to validate correctness. Cleaned up dead-code allowances
now that the code paths are used.
Added explicit MSB-first checks for eq table generation,
gather_prefix_evals stride/pattern, and bind_poly_var_top to ensure
“top” binds the MSB.These tests catch silent index/order regressions
across components.
@wu-s-john wu-s-john changed the title Implement Algorithm 6 Foundation — Procedure 9 Accumulator Builder Implement Faster Sumcheck Algorithm — Procedure 9 Accumulator Builder Dec 18, 2025
Compute ℓ_i(X) = eqe(w[<i], r[<i]) · eqe(w_i, X) values for sum-check
rounds. Compute ℓ_i(0)=α_i(1−w_i), ℓ_i(1)=α_i w_i, ℓ_i(∞)=α_i(2w_i−1)
for sum-check rounds
Replace range-indexed loops and a redundant closure with iterator forms
Add eq-round linear factor utilities and accumulator evaluation to
derive t_i and build s_i polynomials.
Track R_i and ℓ_i state to compare accumulator evals with
EqSumCheckInstance rounds.
indexing

Switch Spartan t_i to D=2 aliases/tests, precompute idx4 prefix/suffix
data, and flatten accumulator caches to cut allocations.
Csr (Compressed Sparse Row) stores variable-length lists with 2
allocations instead of N+1, improving cache locality. Replaces ad-hoc
offsets/entries arrays in build_accumulators
- Add prove_cubic_with_three_inputs_small_value combining small-value
  optimization for first ℓ₀ rounds with eq-poly optimization for
  remaining
- Introduce SPARTAN_T_DEGREE constant to centralize polynomial degree
  parameter
- Add sumcheck_sweep.rs examples for performance comparison
build_accumulators

The new from_boolean_evals_with_buffer_reusing method takes
caller-provided scratch buffers and alternates between them during
extension. This reduces allocations from O(num_x_in × num_x_out) per
call to O(num_threads) buffers allocated once per thread.
variants

Spartan version (D=2) skips binary betas since satisfying witnesses have
Az·Bz = Cz on {0,1}^n. Generic version supports arbitrary polynomial
products.
Adds a new example that tests prove_cubic_with_three_inputs and
prove_cubic_with_three_inputs_small_value produce identical proofs when
used with a real SHA256 circuit (Algorithm 6 validation).

Changes:
- Add PartialEq, Eq derive to SumcheckProof for proof comparison
- Add extract_outer_sumcheck_inputs helper to SpartanSNARK
- Add examples/sumcheck_sha256_equivalence.rs
Implement the small × large multiplication optimization from "Speeding
Up Sum-Check Proving" using Barrett reduction for ~3× speedup over naive
field multiplication.

Key changes:
  - Add SmallValueField trait for type-safe i32/i64 small-value
    operations
  - Implement Barrett reduction for Pallas Fp and Fq (sl_mul, isl_mul)
  - Add SpartanAccumulatorInput trait to unify field and i32 witness
    handling
  - Make LagrangeEvaluatedMultilinearPolynomial generic over element
    type
  - Update sumcheck prover to accept separate i32 witness polynomials
  - Clean up MultilinearPolynomial<i32>: remove unused
    from_u32/from_u64/from_field
@wu-s-john wu-s-john force-pushed the feat/procedure-9-accumulator branch from 2828f04 to 67674c4 Compare December 23, 2025 19:33
evaluations

Replace raw arrays and ad-hoc structs with proper abstractions for U_d =
{∞, 0, 1, ..., D-1} and Û_d = U_d \ {1} evaluation domains. Remove
EqRoundValues in favor of UdEvaluations<F, 2>.
- Delete unused constructor/predicate methods from UdPoint and
  UdHatPoint
- Move test-only methods (alpha, prefix_len, suffix_len,
  extend_from_boolean) to cfg(test) impl blocks
- Add CachedPrefixIndex struct with From impl to accumulator_index.rs
- Remove unused QuadraticTAccumulatorPrefixIndex type alias
- Delete unused eq_factor_alpha method from sumcheck
Hoist scratch buffers to thread-local state in
build_accumulators_spartan and build_accumulators. Previously, 5 vectors
were allocated on every x_out iteration; now allocations happen once per
Rayon thread subdivision.

- Add extend_in_place to LagrangeEvaluatedMultilinearPolynomial (avoids
  .to_vec())
- Add SpartanThreadState and GenericThreadState structs for buffer reuse
- Extract thread state structs to thread_state_accumulators module

Reduces allocations from O(num_x_out × num_x_in) to O(num_threads).
Move the witness polynomial abstraction trait from accumulators.rs to
its own module for better code organization. Rename from
SpartanAccumulatorInput to SpartanAccumulatorInputPolynomial to clarify
that it abstracts over multilinear polynomial representations (field
elements vs small values).
@wu-s-john wu-s-john changed the title Implement Faster Sumcheck Algorithm — Procedure 9 Accumulator Builder Implement Small-Value Sum-Check Optimization (Algorithm 6) Dec 23, 2025
@wu-s-john wu-s-john marked this pull request as ready for review December 23, 2025 23:56
- compute_idx4: derive l0 from beta.len() instead of taking as parameter
- csr: remove unused new() and push_empty(), move test helpers to
  #[cfg(test)]
- accumulators: add #[inline] to num_prefixes()
- examples: switch to tracing and #[instrument] for cleaner logging
- accumulator_index: add phase comments explaining prefix/suffix
  computation
- accumulators: use filter() instead of continue for beta_has_infinity
  check
- lagrange: document stride calculations in extend_in_place
- small_field: extract try_field_to_small_impl to deduplicate Fp/Fq
  impls
- small_field: document Barrett reduction loop bound (at most 2
  iterations)
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements Algorithm 6 ("Small-Value Sum-Check with Eq-Poly Optimization") from the paper "Speeding Up Sum-Check Proving" by Bagad, Dao, Domb, and Thaler. The optimization targets Spartan's first sum-check invocation where witness polynomial evaluations are small integers, achieving significant prover speedups (1.5-1.64×) by replacing expensive field multiplications with cheaper native integer operations.

Key changes:

  • Introduces Barrett-optimized field arithmetic for multiplying small integers with field elements
  • Implements Lagrange domain extension for efficient round polynomial computation
  • Adds accumulator data structures for precomputing sum-check values
  • Integrates the optimization into the existing sum-check protocol

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/small_field.rs Barrett-optimized arithmetic trait for small-value × field-element operations
src/lagrange.rs Lagrange domain types and multilinear polynomial extension logic
src/accumulators.rs Accumulator data structures and Procedure 9 implementation
src/accumulator_index.rs Index mapping for distributing evaluation prefixes to accumulators
src/sumcheck.rs Integration of Algorithm 6 into the sum-check protocol
src/thread_state_accumulators.rs Thread-local buffers to reduce allocations in parallel execution
src/spartan_accumulator_input_polynomial.rs Trait abstraction for witness polynomials
src/polys/multilinear.rs Generic multilinear polynomial type and prefix gathering
src/eq_linear.rs Utilities for eq-polynomial round factors
src/csr.rs Compressed sparse row storage for variable-length lists
examples/sumcheck_sweep.rs Benchmark sweep across polynomial sizes
examples/sumcheck_sha256_equivalence.rs Equivalence test with SHA-256 circuit

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

//! We currently implement a non-preprocessing version of Spartan
//! that is generic over the polynomial commitment and evaluation argument (i.e., a PCS).
#![deny(
warnings,
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unused lint has been removed from the deny list. This change allows unused code warnings to be suppressed, which could hide legitimate issues. Consider keeping unused in the deny list and using targeted #[allow(dead_code)] attributes where specific exceptions are needed.

Suggested change
warnings,
warnings,
unused,

Copilot uses AI. Check for mistakes.

#[inline]
fn small_from_u32(val: u32) -> i32 {
val as i32
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversion from u32 to i32 using as casting is unsafe when the value exceeds i32::MAX. This could lead to incorrect negative values. Consider using checked conversion or documenting the assumption that val <= i32::MAX.

Suggested change
val as i32
i32::try_from(val).expect("small_from_u32: value does not fit in i32")

Copilot uses AI. Check for mistakes.
Comment on lines 548 to 550
Self::from(val as u64)
} else {
-Self::from((-val) as u64)
Copy link

Copilot AI Dec 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The negation using (-val) as u64 is unsafe for i32::MIN (-2147483648) because -i32::MIN overflows. Use val.wrapping_neg() as u64 or val.unsigned_abs() instead to handle this edge case correctly.

Suggested change
Self::from(val as u64)
} else {
-Self::from((-val) as u64)
Self::from(val.unsigned_abs() as u64)
} else {
-Self::from(val.unsigned_abs() as u64)

Copilot uses AI. Check for mistakes.
Direct fixes:
- Remove redundant closures and unnecessary casts
- Replace manual_contains with .contains()
- Replace manual_is_multiple_of with .is_multiple_of()
- Replace useless vec! with array literal

Suppressions (intentional patterns):
- needless_range_loop: loop index serves dual purpose (indexing +
  computation)
- identity_op/erasing_op: operations like `0 * base * base` document
  index formulas

Typos:
- Rename `ein` to `e_in_eval` for clarity (eq evaluation at input point)
Replace per-iteration modular reductions with accumulated wide-integer
arithmetic, reducing once per beta instead of once per x_in iteration.

Key changes:
- Add WideLimbs<N> for wide unsigned integer arithmetic (6/8 limbs)
- Refactor SmallValueField to be generic over small value type (i32/i64)
- Add UnreducedMontInt types for delayed reduction in Montgomery form
- Replace SpartanAccumulatorInputPolynomial with MatVecMLE trait
- Optimize eq polynomial table computation (1 mul instead of 2 per
  element)
- Update benchmark to compare i32/i64 vs i64/i128 variants
- Add mac() helper for fused multiply-accumulate, eliminating temporary
  arrays in unreduced_mont_int_mul_add (4 implementations)
- Subtract in limb space before reduction via sub_mag(), saving one
  Barrett reduction per signed accumulator
- Replace large e_out tables with JIT-computed eyx scratch buffers,
  reducing eq table memory 7× and improving cache locality
- Add unreduced_is_zero() fast path to skip expensive modular reduction
- Precompute betas_with_infty indices to avoid filter in inner loop
- Use barrett_reduce_6_* directly for i128 products instead of padding
  to 8 limbs (saves 8 wasted multiplications per isl_mul call)
propagation

Replace mac(acc, 0, 0, carry) calls with simple overflowing_add to avoid
unnecessary u128 multiply-add pipeline for pure carry propagation. Also
add #[inline(always)] to hot path functions to ensure full inlining.
- Apply rustfmt formatting fixes in accumulators.rs
- Fix clippy manual_is_multiple_of warning in test code
Introduce circuit gadgets optimized for small-value sumcheck
optimization:

- SmallMultiEq: Batches equality constraints with bounded coefficients,
  flushing at MAX_COEFF_BITS (31) instead of bellpepper's ~237. This
  keeps constraint coefficients within i32 bounds for the small-value
  optimization.

- SmallUInt32: 32-bit unsigned integer gadget using SmallMultiEq for
  carry constraints in addmany operations.

- small_sha256: SHA-256 implementation using the above gadgets,
  producing circuits where Az and Bz values fit in i32.

- Update sumcheck_sha256_equivalence example to use bellpepper's Circuit
  trait for constraint counting, comparing SmallSha256 vs bellpepper
  SHA-256.

The tradeoff: SmallSha256 generates ~17% more R1CS constraints due to
more frequent MultiEq flushing, but enables the small-value sumcheck
optimization.

Add 16-bit limbed addition for i32 small-value optimization

SmallUInt32::addmany produces coefficients up to 2^34, exceeding i32
bounds. Splitting into 16-bit limbs reduces max coefficient to 2^18,
enabling i32/i64 small-value sumcheck for SHA-256.

- Add SmallValueConfig trait with Small32 (i32/i64) and Small64
  (i64/i128)
- Implement addmany_limbed using two constraints per addition
- Update SmallMultiEq to be generic over config
- Fix example to use config-specific bounds check
- Add examples/sha256_chain_benchmark.rs comparing original vs
  small-value sumcheck performance on SHA-256 hash chains
- CSV output includes witness synthesis time, sumcheck times, speedup,
  and witness percentage of total proving time
- CLI support: single <num_vars> for profiling, range-sweep for
  benchmarks
- Add small_sha256_with_prefix() for chaining multiple SHA-256 hashes
  with unique constraint namespaces
- Fix SmallValueField<i64> generic in lagrange.rs
- Fix unused variable warning in msm.rs
@microsoft-github-policy-service

@wu-s-john please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

Contribution License Agreement

This Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
and conveys certain license rights to Microsoft Corporation and its affiliates (“Microsoft”) for Your
contributions to Microsoft open source projects. This Agreement is effective as of the latest signature
date below.

  1. Definitions.
    “Code” means the computer software code, whether in human-readable or machine-executable form,
    that is delivered by You to Microsoft under this Agreement.
    “Project” means any of the projects owned or managed by Microsoft and offered under a license
    approved by the Open Source Initiative (www.opensource.org).
    “Submit” is the act of uploading, submitting, transmitting, or distributing code or other content to any
    Project, including but not limited to communication on electronic mailing lists, source code control
    systems, and issue tracking systems that are managed by, or on behalf of, the Project for the purpose of
    discussing and improving that Project, but excluding communication that is conspicuously marked or
    otherwise designated in writing by You as “Not a Submission.”
    “Submission” means the Code and any other copyrightable material Submitted by You, including any
    associated comments and documentation.
  2. Your Submission. You must agree to the terms of this Agreement before making a Submission to any
    Project. This Agreement covers any and all Submissions that You, now or in the future (except as
    described in Section 4 below), Submit to any Project.
  3. Originality of Work. You represent that each of Your Submissions is entirely Your original work.
    Should You wish to Submit materials that are not Your original work, You may Submit them separately
    to the Project if You (a) retain all copyright and license information that was in the materials as You
    received them, (b) in the description accompanying Your Submission, include the phrase “Submission
    containing materials of a third party:” followed by the names of the third party and any licenses or other
    restrictions of which You are aware, and (c) follow any other instructions in the Project’s written
    guidelines concerning Submissions.
  4. Your Employer. References to “employer” in this Agreement include Your employer or anyone else
    for whom You are acting in making Your Submission, e.g. as a contractor, vendor, or agent. If Your
    Submission is made in the course of Your work for an employer or Your employer has intellectual
    property rights in Your Submission by contract or applicable law, You must secure permission from Your
    employer to make the Submission before signing this Agreement. In that case, the term “You” in this
    Agreement will refer to You and the employer collectively. If You change employers in the future and
    desire to Submit additional Submissions for the new employer, then You agree to sign a new Agreement
    and secure permission from the new employer before Submitting those Submissions.
  5. Licenses.
  • Copyright License. You grant Microsoft, and those who receive the Submission directly or
    indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license in the
    Submission to reproduce, prepare derivative works of, publicly display, publicly perform, and distribute
    the Submission and such derivative works, and to sublicense any or all of the foregoing rights to third
    parties.
  • Patent License. You grant Microsoft, and those who receive the Submission directly or
    indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license under
    Your patent claims that are necessarily infringed by the Submission or the combination of the
    Submission with the Project to which it was Submitted to make, have made, use, offer to sell, sell and
    import or otherwise dispose of the Submission alone or with the Project.
  • Other Rights Reserved. Each party reserves all rights not expressly granted in this Agreement.
    No additional licenses or rights whatsoever (including, without limitation, any implied licenses) are
    granted by implication, exhaustion, estoppel or otherwise.
  1. Representations and Warranties. You represent that You are legally entitled to grant the above
    licenses. You represent that each of Your Submissions is entirely Your original work (except as You may
    have disclosed under Section 3). You represent that You have secured permission from Your employer to
    make the Submission in cases where Your Submission is made in the course of Your work for Your
    employer or Your employer has intellectual property rights in Your Submission by contract or applicable
    law. If You are signing this Agreement on behalf of Your employer, You represent and warrant that You
    have the necessary authority to bind the listed employer to the obligations contained in this Agreement.
    You are not expected to provide support for Your Submission, unless You choose to do so. UNLESS
    REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING, AND EXCEPT FOR THE WARRANTIES
    EXPRESSLY STATED IN SECTIONS 3, 4, AND 6, THE SUBMISSION PROVIDED UNDER THIS AGREEMENT IS
    PROVIDED WITHOUT WARRANTY OF ANY KIND, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTY OF
    NONINFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
  2. Notice to Microsoft. You agree to notify Microsoft in writing of any facts or circumstances of which
    You later become aware that would make Your representations in this Agreement inaccurate in any
    respect.
  3. Information about Submissions. You agree that contributions to Projects and information about
    contributions may be maintained indefinitely and disclosed publicly, including Your name and other
    information that You submit with Your Submission.
  4. Governing Law/Jurisdiction. This Agreement is governed by the laws of the State of Washington, and
    the parties consent to exclusive jurisdiction and venue in the federal courts sitting in King County,
    Washington, unless no federal subject matter jurisdiction exists, in which case the parties consent to
    exclusive jurisdiction and venue in the Superior Court of King County, Washington. The parties waive all
    defenses of lack of personal jurisdiction and forum non-conveniens.
  5. Entire Agreement/Assignment. This Agreement is the entire agreement between the parties, and
    supersedes any and all prior agreements, understandings or communications, written or oral, between
    the parties relating to the subject matter hereof. This Agreement may be assigned by Microsoft.

Split SmallValueField into two traits for better separation of concerns:
- SmallValueField: core small-value operations (ss_mul, sl_mul, isl_mul)
- DelayedReduction: unreduced accumulator operations for hot paths

Rename types for clarity:
- UnreducedMontInt → UnreducedFieldInt (field × integer products)
- UnreducedMontMont → UnreducedFieldField (field × field products)

Add FieldReductionConstants trait to deduplicate Barrett/Montgomery
reduction:
- Consolidates Fp/Fq constants (MODULUS, R256-R512, MONT_INV)
- Generic reduction functions monomorphized at compile time for zero
  overhead
- Comprehensive documentation explaining R constants (2^k mod p)

Performance and cleanup:
- Add ext_buf_idx scratch buffer to avoid Vec allocation in accumulator
  hot loop
- Remove unused OrderedVariable from shape_cs modules (~140 lines)
- Remove unused build_univariate_round_evals from sumcheck (~40 lines)
- Add log2_constraints column to benchmark CSV output
Split the 2,367-line small_field.rs into a proper module structure:
- small_field/small_value_field.rs: SmallValueField trait
- small_field/delayed_reduction.rs: DelayedReduction trait
- small_field/barrett.rs: Barrett/Montgomery reduction functions
- small_field/impls.rs: Fp/Fq implementations and tests
- small_field/mod.rs: re-exports and helper functions

Moved batching configuration types (NoBatching, Batching<K>,
BatchingMode, SmallMultiEqConfig, I32NoBatch, I64Batch21) from
small_field to gadgets/small_multi_eq.rs where they logically belong,
since they're specifically for constraint batching in SmallMultiEq.

Added detailed documentation for I64Batch21 explaining why K=21 is the
safe maximum: with SHA-256-like circuits having ~200 terms and 2^34
positional coefficients, batching 21 constraints keeps the worst-case
magnitude (2^62) under the i64 signed limit (2^63).
contributions

Refactors shared logic between Spartan and generic accumulator builders.
@wu-s-john wu-s-john force-pushed the feat/procedure-9-accumulator branch from 878e7b0 to 406b59e Compare January 9, 2026 06:00
Improves type safety and self-documentation by replacing (bool, [u64;
N]) with an explicit enum indicating whether the result is positive (a
>= b) or negative (a < b).
Move wide_limbs.rs content and limb arithmetic from barrett.rs into a
unified small_field/limbs.rs module for delayed modular reduction.
  Split monolithic lagrange.rs (1667 lines) into focused submodules:
  - domain.rs: LagrangePoint, LagrangeHatPoint, LagrangeIndex
  - evals.rs: LagrangeEvals, LagrangeHatEvals
  - basis.rs: LagrangeBasisFactory, LagrangeCoeff
  - extension.rs: LagrangeEvaluatedMultilinearPolynomial
  - accumulator.rs: RoundAccumulator, LagrangeAccumulators
  - accumulator_builder.rs: build_accumulators_spartan,
    build_accumulators

  Consolidate related files into the module:
  - accumulator_index.rs → index.rs
  - thread_state_accumulators.rs → thread_state.rs
  - eq_linear.rs → eq_round.rs

  Simplify extend_in_place API: use std::mem::swap to ensure result is
  always in first buffer, eliminating conditional buffer selection at
  call sites. Rename buf_a/b to buf_curr/scratch for clarity.
  - Refactor SmallMultiEq from struct to trait with NoBatchEq and
    BatchingEq<K> implementations
  - Add addmany module with limbed (i32) and full (i64) addition
    algorithms
  - Deduplicate SHA-256 circuits into examples/circuits/sha256/ module
  - Update small_uint32 and small_sha256 to use SmallMultiEq trait
phase

- Extend MatVecMLE trait with UnreducedFieldField type for F×F
  accumulation
- Add unreduced bucket accumulators to SpartanThreadState
- Replace eyx precomputation with direct e_y access and z_beta = ex *
  tA_red
- Keep unreduced across all x_out iterations and merge without reduction
- Pre-compute beta values to eliminate closure overhead in scatter loop
- Final Montgomery reduction only once per bucket after thread merge

This reduces Montgomery reductions from ~7000+ per x_out to ~26 total
for typical parameters (l0=3, 128 x_outs).
savings

Replace asymmetric l/2 split with balanced ceil/floor split. This
reduces precomputation cost (e.g., 36→24 for l=10, l0=3), enables odd
number of rounds, and improves cache utilization by making e_xout
smaller.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant