Draft
Conversation
Bench (AArch64 Cortex-A76, -O2, taskset core 2, `bench_bitcoin -filter='CHACHA20_.*' -min-time=1000`, 5 runs; median [min-max] ns/byte): - clang 22: 1MB 2.20 [2.20-2.21], 256B 2.35 [2.35-2.35], 64B 2.59 [2.57-2.62] - gcc 14: 1MB 2.45 [2.44-2.45], 256B 2.51 [2.51-2.51], 64B 2.69 [2.68-2.70] CHACHA20_64BYTES is the single-block path, so it's a good sanity-check for noise. Assembly (scalar path): both compilers lower `std::rotl` to rotates and keep the round math in scalar registers. Example (gcc, quarterround fragment): eor w3, w3, w7 ror w3, w3, #16 add w5, w5, w2 Delta vs base: no measurable change (this is a refactor to simplify later vector work).
Bench (AArch64 Cortex-A76, -O2, taskset core 2, `bench_bitcoin -filter='CHACHA20_.*' -min-time=1000`, 5 runs; median [min-max] ns/byte):
- clang 22: 1MB 1.80 [1.79-1.80], 256B 1.63 [1.63-1.64], 64B 2.59 [2.57-2.60]
- gcc 14: 1MB 5.37 [5.37-5.38], 256B 5.14 [5.13-5.15], 64B 2.70 [2.70-2.75]
The speedup/slowdown only shows up once we hit the multi-block path (1MB/256B).
Single-block (64B) remains scalar and stays ~unchanged.
Assembly highlights (AArch64):
- clang emits NEON-friendly rotates/shuffles (`shl`+`usra` and `ext`) with a small stack frame.
- gcc emits a very large stack frame and scalar pack/unpack sequences around shuffles.
Example prologue (gcc):
mov x13, #0x9160
sub sp, sp, x13
Example inner-sequence (gcc):
fmov x18, d18
bfxil x10, x18, #0, #32
Example inner-sequence (clang):
usra v25.4s, v16.4s, #25
ext v22.16b, v10.16b, v10.16b, #4
Delta vs previous commit:
- clang: ~18% faster at 1MB (2.20 -> 1.80 ns/B)
- gcc: ~2.2x slower at 1MB (2.45 -> 5.37 ns/B) due to poor multi-state codegen.
… `static_for` loops
Bench (AArch64 Cortex-A76, -O2, taskset core 2, `bench_bitcoin -filter='CHACHA20_.*' -min-time=1000`, 5 runs; median [min-max] ns/byte):
- clang 22: 1MB 1.80 [1.79-1.80], 256B 1.63 [1.63-1.64], 64B 2.59 [2.58-2.60]
- gcc 14: 1MB 6.66 [6.64-6.79], 256B 5.02 [5.02-5.03], 64B 2.70 [2.68-2.72]
This refactor keeps clang flat, but makes gcc's 1MB case substantially worse.
Assembly highlights (gcc): instruction count explodes (CHACHA20_1MB `ins/byte` ~43.7)
with many vector loads/stores and branches (lambda clones / `ld1`/`st1` heavy). Example
(from one of the inlined helper clones):
st1 {v26.16b-v27.16b}, [x4]
ldp q26, q27, [x2, #64]
Delta vs previous commit:
- gcc: 1MB 5.37 -> 6.66 ns/B (regression)
- clang: essentially unchanged.
…ime iteration Bench (AArch64 Cortex-A76, -O2, taskset core 2, `bench_bitcoin -filter='CHACHA20_.*' -min-time=1000`, 5 runs; median [min-max] ns/byte): - clang 22: 1MB 1.85 [1.85-1.89], 256B 1.72 [1.72-1.73], 64B 2.59 [2.59-2.60] - gcc 14: 1MB 4.51 [4.50-4.51], 256B 4.59 [4.58-4.59], 64B 2.72 [2.70-2.72] This is the first refactor that materially helps gcc again: the multi-state path shrinks substantially (much less codegen bloat), reducing `ins/byte` (43.7 -> 25.5) for CHACHA20_1MB. Assembly highlight (gcc): far less scalar shuffling glue and reduced stack pressure (stack allocation drops from ~0x16c0 to ~0x1530, and objdump size shrinks sharply). Delta vs previous commit: - gcc: 1MB 6.66 -> 4.51 ns/B (still slower than scalar baseline, but improved) - clang: slight regression (1.80 -> 1.85 ns/B), consistent with less aggressive unrolling.
…efficiency Bench (AArch64 Cortex-A76, -O2, taskset core 2, `bench_bitcoin -filter='CHACHA20_.*' -min-time=1000`, 5 runs; median [min-max] ns/byte): - clang 22: 1MB 1.79 [1.79-1.80], 256B 1.63 [1.63-1.64], 64B 2.59 [2.58-2.60] - gcc 14: 1MB 5.36 [5.35-5.36], 256B 5.16 [5.15-5.16], 64B 2.72 [2.69-2.73] The additional unrolling helps clang but hurts gcc again. On gcc the multi-state function grows and spills more (large stack frame), pushing 1MB back near the original regression. Delta vs previous commit: - gcc: 1MB 4.51 -> 5.36 ns/B (regression) - clang: 1MB 1.85 -> 1.79 ns/B (improvement)
…0 vector implementation Bench (AArch64 Cortex-A76, -O2, taskset core 2, `bench_bitcoin -filter='CHACHA20_.*' -min-time=1000`, 5 runs; median [min-max] ns/byte): - clang 22: 1MB 1.86 [1.86-1.87], 256B 1.73 [1.72-1.73], 64B 2.60 [2.58-2.60] - gcc 14: 1MB 5.74 [5.73-5.74], 256B 5.29 [5.29-5.30], 64B 2.71 [2.69-2.73] This reshuffle/loop consolidation ends up worsening both compilers slightly, but the impact is far larger on gcc. The gcc variant again has a huge stack frame and many extra instructions in the multi-state path (`ins/byte` ~35.7 for CHACHA20_1MB). Assembly contrast (AArch64): - clang: still uses `ext` for lane shuffles and keeps stack relatively small. - gcc: spills and uses scalar pack/unpack sequences; stack allocation is ~0x60a0. Delta vs previous commit: - clang: 1MB 1.79 -> 1.86 ns/B - gcc: 1MB 5.36 -> 5.74 ns/B
…20 handling Bench (AArch64 Cortex-A76, -O2, taskset core 2, `bench_bitcoin -filter='CHACHA20_.*' -min-time=1000`, 5 runs; median [min-max] ns/byte): - clang 22: 1MB 1.86 [1.85-1.86], 256B 1.73 [1.72-1.73], 64B 2.59 [2.58-2.60] - gcc 14: 1MB 5.74 [5.73-5.75], 256B 5.29 [5.28-5.29], 64B 2.71 [2.69-2.73] On this Cortex-A76 benchmark, results are unchanged vs the prior commit (within measurement noise). The changes here primarily prepare/extend the generic logic for a broader set of targets.
Bench (AArch64 Cortex-A76, -O2, taskset core 2, `bench_bitcoin -filter='CHACHA20_.*' -min-time=1000`, 5 runs; median [min-max] ns/byte): - clang 22: 1MB 1.86 [1.85-1.86], 256B 1.72 [1.72-1.73], 64B 2.59 [2.59-2.60] - gcc 14: 1MB 5.79 [5.78-5.81], 256B 5.29 [5.28-5.29], 64B 2.71 [2.69-2.72] This change is mostly about refining GCC gating on other architectures (e.g. x86 with/without AVX2). On AArch64 it doesn't improve GCC's multi-state codegen yet; GCC still emits a very large vectorized function (stack allocation ~0x5920) and high instruction counts.
…d paths Bench (AArch64 Cortex-A76, -O2, taskset core 2, `bench_bitcoin -filter='CHACHA20_.*' -min-time=1000`, 5 runs; median [min-max] ns/byte): - clang 22: 1MB 1.86 [1.86-1.86], 256B 1.73 [1.72-1.73], 64B 2.59 [2.58-2.60] - gcc 14: 1MB 2.45 [2.44-2.45], 256B 2.53 [2.52-2.53], 64B 2.71 [2.69-2.72] Key point: gcc's multi-state vectorized path was a regression on AArch64 (5.7 ns/B class). This commit avoids that by disabling all multi-state variants for gcc on AArch64, effectively falling back to the scalar implementation for multi-block inputs (bringing gcc back near baseline). Also fix the build when all multi-state paths are disabled: avoid referencing `process_blocks<N>` from code that is preprocessor-disabled, so GCC can compile cleanly with a complete disable set.
On AArch64/NEON, GCC's codegen for 256-bit `__builtin_shufflevector` patterns was the root cause of the large perf gap (scalar spills + `fmov`/`bfi`/`bfxil` sequences). Keep Clang on the existing 256-bit vector path, but use a GCC-specific split-lane `vec256` representation (two 128-bit lanes) so GCC can use native NEON shuffles and keep the state in registers. This also enables a multi-state path for GCC again on AArch64 (use 8/4-state; keep 16/6 disabled to limit register pressure). Bench (AArch64 Cortex-A76, -O2, taskset core 2, `bench_bitcoin -filter='CHACHA20_.*' -min-time=1000`, 5 runs; median ns/byte): - GCC 14.2: 1MB 1.85, 256B 2.17, 64B 2.71 - Clang 22: 1MB 1.87, 256B 1.73, 64B 2.59
On AArch64/NEON there are 32 128-bit vector registers. The “16-state” variant (8 half-states) needs ~64 128-bit lanes worth of live state (because `vec256` lowers to two 128-bit lanes on NEON), so it spills heavily (notably on clang). Disable `STATES_16` on AArch64 to force the 8-state path, which fits in registers and is substantially faster. Also disable `STATES_6` on AArch64: it increases code size and hurts the common 8/4-state path on this target. Make the per-half-state helpers compile-time sized (no runtime `half_states` argument). This lets compilers fully specialize the inner loops; GCC in particular stops generating extra control-flow and spill glue around the multi-state path. Finally, on AArch64/NEON clang's codegen for the aligned I/O fast-path (`std::assume_aligned` + 32-byte memcpy) is slower than the plain unaligned variant. Prefer the unaligned path for clang. Bench (AArch64 Cortex-A76, -O2, taskset core 2, `bench_bitcoin -filter='CHACHA20_.*' -min-time=10000`, 5 runs; median [min-max] ns/byte): - clang 22: 1MB 1.47 [1.46-1.48], 256B 1.64 [1.64-1.65], 64B 2.59 [2.59-2.60] - gcc 14: 1MB 1.71 [1.71-1.71], 256B 1.95 [1.95-1.97], 64B 2.70 [2.69-2.72] Delta vs previous commit (CHACHA20_1MB, -min-time=10000): - clang: 1.86 -> 1.47 ns/B (avoid 16-state spills; avoid aligned fast-path) - gcc: 1.85 -> 1.71 ns/B (tightened half-state loops)
l0rinc
commented
Feb 17, 2026
| } | ||
|
|
||
| inline void ChaCha20Aligned::Crypt(std::span<const std::byte> in_bytes, std::span<std::byte> out_bytes) noexcept | ||
| static inline void chacha20_crypt(std::span<const std::byte> in_bytes, std::span<std::byte> out_bytes, uint32_t input[12]) noexcept |
| std::byte* c = out_bytes.data(); | ||
| size_t blocks = out_bytes.size() / BLOCKLEN; | ||
| assert(blocks * BLOCKLEN == out_bytes.size()); | ||
| size_t blocks = out_bytes.size() / ChaCha20Aligned::BLOCKLEN; |
l0rinc
commented
Feb 17, 2026
| } | ||
|
|
||
| inline void ChaCha20Aligned::Crypt(std::span<const std::byte> in_bytes, std::span<std::byte> out_bytes) noexcept | ||
| static inline void chacha20_crypt(std::span<const std::byte> in_bytes, std::span<std::byte> out_bytes, uint32_t input[12]) noexcept |
| } | ||
|
|
||
| inline void ChaCha20Aligned::Crypt(std::span<const std::byte> in_bytes, std::span<std::byte> out_bytes) noexcept | ||
| static inline void chacha20_crypt(std::span<const std::byte> in_bytes, std::span<std::byte> out_bytes, uint32_t input[12]) noexcept |
| #include <crypto/chacha20_vec.h> | ||
| #include <support/cleanse.h> | ||
|
|
||
| #include <algorithm> |
| #include <crypto/common.h> | ||
| #include <crypto/chacha20.h> | ||
| #include <crypto/chacha20_vec.h> | ||
| #include <support/cleanse.h> |
| #include <algorithm> | ||
| #include <bit> | ||
| #include <cassert> | ||
| #include <limits> |
Owner
Author
There was a problem hiding this comment.
direct comment from global
l0rinc
commented
Feb 17, 2026
src/crypto/chacha20_vec.ipp
Outdated
| } | ||
|
|
||
| template <size_t N, typename Fn> | ||
| ALWAYS_INLINE void static_for(Fn&& fn) |
l0rinc
commented
Feb 17, 2026
src/crypto/chacha20_vec.ipp
Outdated
|
|
||
| using vec256 = uint32_t __attribute__((__vector_size__(32))); | ||
|
|
||
| // Like Bitcoin Core's `ALWAYS_INLINE` in other files, but kept local to avoid touching shared headers. |
src/crypto/chacha20_vec.ipp
Outdated
| { | ||
| for (size_t i = 0; i < half_states; ++i) { | ||
| arr[i] = vec; | ||
| CHACHA20_VEC_UNROLL(8) |
l0rinc
commented
Feb 17, 2026
src/crypto/chacha20_vec.ipp
Outdated
|
|
||
| using vec256 = uint32_t __attribute__((__vector_size__(32))); | ||
|
|
||
| // Like Bitcoin Core's `ALWAYS_INLINE` in other files, but kept local to avoid touching shared headers. |
| inline void ChaCha20Aligned::Crypt(std::span<const std::byte> in_bytes, std::span<std::byte> out_bytes) noexcept | ||
| static inline void chacha20_crypt(std::span<const std::byte> in_bytes, std::span<std::byte> out_bytes, uint32_t input[12]) noexcept | ||
| { | ||
| assert(in_bytes.size() == out_bytes.size()); |
|
|
||
| static_assert(ChaCha20Aligned::BLOCKLEN == CHACHA20_VEC_BLOCKLEN); | ||
|
|
||
| #define QUARTERROUND(a,b,c,d) \ |
l0rinc
commented
Feb 17, 2026
| inline void ChaCha20Aligned::Crypt(std::span<const std::byte> in_bytes, std::span<std::byte> out_bytes) noexcept | ||
| static inline void chacha20_crypt(std::span<const std::byte> in_bytes, std::span<std::byte> out_bytes, uint32_t input[12]) noexcept | ||
| { | ||
| assert(in_bytes.size() == out_bytes.size()); |
|
|
||
| static_assert(ChaCha20Aligned::BLOCKLEN == CHACHA20_VEC_BLOCKLEN); | ||
|
|
||
| #define QUARTERROUND(a,b,c,d) \ |
l0rinc
commented
Feb 17, 2026
|
|
||
| #define QUARTERROUND(a,b,c,d) \ | ||
| a += b; d = std::rotl(d ^ a, 16); \ | ||
| c += d; b = std::rotl(b ^ c, 12); \ |
| blocks -= 1; | ||
| c += BLOCKLEN; | ||
| m += BLOCKLEN; | ||
| c += ChaCha20Aligned::BLOCKLEN; |
| @@ -7,11 +7,15 @@ | |||
|
|
|||
| #include <crypto/common.h> | |||
| #include <crypto/chacha20.h> | |||
l0rinc
commented
Feb 17, 2026
|
|
||
| #define QUARTERROUND(a,b,c,d) \ | ||
| a += b; d = std::rotl(d ^ a, 16); \ | ||
| c += d; b = std::rotl(b ^ c, 12); \ |
| blocks -= 1; | ||
| c += BLOCKLEN; | ||
| m += BLOCKLEN; | ||
| c += ChaCha20Aligned::BLOCKLEN; |
src/crypto/chacha20_vec.ipp
Outdated
| } | ||
|
|
||
| template <size_t N, typename Fn, size_t... Is> | ||
| ALWAYS_INLINE void static_for_impl(Fn&& fn, std::index_sequence<Is...>) |
| ALWAYS_INLINE void static_for_impl(Fn&& fn, std::index_sequence<Is...>) | ||
| { | ||
| (fn(std::integral_constant<size_t, Is>{}), ...); | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
lorinc@M4-Max bitcoin % for commit in ab23325 1cf4ca6 781876e e81ad4f 684c6b8; do
git fetch origin $commit >/dev/null 2>&1 && git checkout $commit >/dev/null 2>&1 && echo "" && git log -1 --pretty='%h %s' &&
rm -rf build >/dev/null 2>&1 && cmake -B build -DBUILD_BENCH=ON -DCMAKE_BUILD_TYPE=Release >/dev/null 2>&1 &&
cmake --build build -j$(nproc) >/dev/null 2>&1 &&
for _ in $(seq 5); do
sleep 5;
sudo taskpolicy -t 5 -l 5 nice -n -20 ./build/bin/bench_bitcoin -filter='CHACHA20_.*' -min-time=1000;
done;
done
ab23325 Merge bitcoin#33866: refactor: Let CCoinsViewCache::BatchWrite return void
CHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTES1cf4ca6 chacha20: move single-block crypt to inline helper function
CHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTES781876e chacha20: Add generic vectorized chacha20 implementation
CHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESe81ad4f refactor: replace recursive templates in ChaCha20 implementation with
static_forloopsCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTES684c6b8 refactor: replace template-based static_for use in ChaCha20 with runtime iteration
CHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTES3b47fec refactor: unroll ChaCha20 vector operations for improved clarity and efficiency
CHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTES3e6fcae refactor: unify loop unrolling macros and refactor ChaCha20 vector operations for clarity
CHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTES64471e2a64 refactor: modularize ChaCha20 vector operations and consolidate common patterns
CHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTESCHACHA20_1MBCHACHA20_256BYTESCHACHA20_64BYTES