This is a proof-of-concept implementation (in Nim) of the following idea:
We want to speed up the proof generation of Rate Limiting Nullifiers (RLN), by exploiting the following simple observation:
The witness of the Groth16 zero-knowledge proof is dominated by a part (namely, the Merkle inclusion proof) which changes much less often than the proof normally generation happens; and this fact combined with how Groth16 proofs work allows for a significant amount of precalculation. Details below.
Our circom circuit is very similar to the one in the PSE repo, and implements essentially the RLNv2 spec, but it's not exactly the same (see below for the details). However this optimization should apply to those too the same way.
TODO: proper benchmarking; but for some preliminary numbers, see below:
We used the default circuit parameters:
LIMIT_BITS = 16(max 65536 messages per epoch per user)MERKLE_DEPTH = 20(max 1 million registered users)
Circuit sizes:
- witness size = 5637 field elements
- unchanging = 5123 (can be precalculated)
- remaining = 514 (to be done each time)
So we can see that only less than 10% of the witness is changing at every proof generation, the rest is changing only when a new user registers (and thus the Merkle tree changes).
Single-threaded (macbook pro M2), excluding witness generation:
the quotient (FFTs) took 0.0151 seconds
pi_A (G1 MSM) took 0.0203 seconds
rho (G1 MSM) took 0.0243 seconds
pi_B (G2 MSM) took 0.0869 seconds
pi_C (2x G1 MSM) took 0.0528 seconds
---------------------------------------
full proof took 0.2009 seconds
From this we can see that
Some preliminary numbers for the partial proofs:
generating full witness : 0.0013 seconds
generating full proof : 0.2015 seconds
generating partial witness : 0.0021 seconds
generating partial proof : 0.1362 seconds
finishing partial proof : 0.0630 seconds
So we can already see a nice speedup of about 300%.
Note: This is very much just a quickly hacked together experiment, and there may be some further optimization opportunities (though on a second sight, they appear to be very minor...).
Note that we are not generic in the curve/field choice, requiring the BN254 curve. This is only a limitation of this particular implementation; it would work exactly the same with eg. BLS12-381.
Actual circuit differences:
- we use Poseidon2 hash instead of
circomlib's Poseidon. The main reason for this is that we needed compatible CPU and circuit implementations, and this was already implemented and ready to use. We need the CPU version to generate test inputs for the proof. - we only use the Poseidon2 permutation with fixed width
t=3(for simplicity, and the above reason). In contrast,circomliboffers various widths (I believet=2,3,4is used in PSE's circuit. A widthtpermutation can consumet-1field elements in one invocation) - when computing
a1, we use the formulaa1 := H(sk+j|ext)instead ofH(sk|ext|j), as this results in one lesst=3hash invocation (but this shouldn't really cause any issues, asskis authenticated andjis range checked (and small)). - the Merkle root is an input, not an output (this way you cannot accidentally forget to check it to be correct externally, in the surrounding code)
- we input the Merkle leaf index directly, not as bits (you need check them to be bits anyway, so this is essentially free; and somewhat simpler to use)
- no external dependencies (note that
circomlibis LGPL, and used to be GPL a few years ago)
Remarks:
- if one feels "uneasy" with the
sk+jhack ina1, an alternative while still keepingt=3would be to usea1 = H(H(sk|ext)|j). This is more expensive (an extra hash permutation invocation), but the inner hashH(sk|ext)only changes once per epoch. So technically one could do a three-layered computation: Precompute most of the things when the Merkle root changes; precompute this one hash at the start of epoch; and finish at each message. We don't implement this 3 layers here, as it would add a lot of extra complexity. - One could also just use a wider
t=4permutation as in PSE's version (that's less expensive than two copies of thet=3, but more expensive than thesk+jversion) - the computation
local_null = H(a1)could be optimized by using at=2Poseidon instance (instead oft=3), the same way as in the PSE version. We don't do that here simply because the lack of pre-made such Poseidon2 instance (see above).
Public inputs:
"merkle_root"(rarely changes)"ext_null"(changes once per epoch)"msg_hash"(changes at each message)
Private inputs:
"secret_key"(never changes)"msg_limit"(never changes)"leaf_idx"(normally never changes, though I guess in theory the registry could "garbage collect" once a while)"merkle_path"(changes only when the Merkle root changes)"msg_idx"(changes at each message)
Public outputs:
"y_value""local_null"
We want to precalculate the part of the witness which only depends on the rarely changing inputs, including "merkle_path"; and also the part of the proof which only depends on this part of the witness.
A first step is to figure out which part of the witness is constant (in practice, it's of course not constant just rarely changed, but we will call it constant for brevity).
As we prefer to use (relatively) high-level languages (eg. circom) to write our circuit, this should be automated; and we also need to split the witness generation into precalculation and finishing.
This is not too hard to do: We only need to annotate which circuit inputs (both private and public inputs) are constant, and then from the witness calculation graph we can derive all elements of the witness which are constant.
The same way we can also easily split the witness generation into two parts.
The goal of circom-witnesscalc, associated with circom (but kind of originated externally) is to "liberate" the witness generation from the "old circom" WASM/C++ generated code, both of which are completely impenetrable.
Remark: circom actually allows the structure of the witness computation to depend on the input (not only on the circuit), which makes it more flexible (but harder to implement). But this flexibility is relatively rarely needed. Fortunately for us, because of the complexity, early versions couldn't handle this "dynamic" case, only the cases when the witness computation was a completely static, linear program (no control flow).
So while recent versions of circom-witnesscalc implement a virtual machine to execute the witness generation process (instead of the old generated WASM or non-portable C++), the initial versions produced a so-called "computation graph" instead, which is essentially a static linear sequence of primitive operations, without any control flow.
This graph can be exported into a file via the build-circuit program, and then can be parsed and interpreted, compiled or analysed.
The static structure makes it very easy to figure out exactly which elements of the witness depend only the "constant" circuit inputs.
Recall the formulas to calculate a Groth16 proof (if you are reading this on github, well, unfortunately, github's LaTeX parsing is completely broken...):
Here snarkjs version is slightly different here);
Observe that this proof computation is dominated by 5 multi-scalar multiplications (MSM), 4 of which depends only on the witness (the 5th is with the coefficients of the quotient polynomial, that unfortunately cannot be precalculated).
So we will simply precalculate the sums (MSM-s) of the form
where
The only other significant computation is computing the quotient polynomial; that's usually done with FFT. Some part of that can be also possibly partially precomputed, but most probably won't give a significant speedup.
From some measurement experiments, it seems that bulk (like 98%) of the Groth16 proof computation is spread over the following computations:
-
$\pi_A$ is an MSM on$\mathbb{G}_1$ of size$M$ -
$\rho$ is an MSM on$\mathbb{G}_1$ of size$M$ -
$\pi_B$ is an MSM on$\mathbb{G}_2$ of size$M$ (note:$\mathbb{G}_2$ is significantly slower!) -
$\sum z_j*[K_j]$ which is a$\mathbb{G}_1$ MSM of size$(M-#\mathsf{pub})\approx M$ - computing
$q_i$ , which is 3 IFFT / FFT pairs (basically 6 FFTs) of size$N$ (rounded up to the next power of two) -
$\sum q_i*[Z_i]$ which is again a size$N$ MSM on$\mathbb{G}_1$
(here
The first 4 of these offers a significant opportunity to precalculation, but the last two doesn't really.
Given that these computations are pretty much what they are, observing any reasonable implementation gives you a rough bound of how much speedup this idea can get you. Based on this, unfortunately I don't expect any more low-hanging fruits (so, the 3x speedup it is).