-
Notifications
You must be signed in to change notification settings - Fork 0
cross-language benchmark: elastic hash 2-3x faster on both x86 and Apple M4 #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
joshuaisaact
wants to merge
195
commits into
main
Choose a base branch
from
autoresearch/cross-language-bench
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
195 commits
Select commit
Hold shift + click to select a range
a4a2c7a
tier-first search order in get() for better cache locality
joshuaisaact 25c5de2
Revert "tier-first search order in get() for better cache locality"
joshuaisaact 330ddb6
cross-tier prefetching in get() to hide inter-tier latency
joshuaisaact d5d671a
Revert "cross-tier prefetching in get() to hide inter-tier latency"
joshuaisaact f9fc247
remove all prefetching from get() to test if it helps or hurts
joshuaisaact 18b4ce8
add tier-0 probe-0 fast path in get()
joshuaisaact 446e450
Revert "add tier-0 probe-0 fast path in get()"
joshuaisaact b48b0a3
reduce MAX_PROBES from 32 to 24
joshuaisaact 6bbdd09
reduce MAX_PROBES from 24 to 20
joshuaisaact 6ca9ddc
cap get() tier search to 8 tiers
joshuaisaact d42e9ee
reduce MAX_LOOKUP_TIERS from 8 to 6
joshuaisaact 33c9feb
Revert "reduce MAX_LOOKUP_TIERS from 8 to 6"
joshuaisaact 9530f6c
use fixed-size arrays for tier metadata instead of heap slices
joshuaisaact 1c52805
Revert "use fixed-size arrays for tier metadata instead of heap slices"
joshuaisaact 907f4f8
switch hash to Stafford variant 13 (splitmix64 finalizer)
joshuaisaact a8629ca
Revert "switch hash to Stafford variant 13 (splitmix64 finalizer)"
joshuaisaact bbf67f0
switch to linear probing for better cache locality
joshuaisaact 6e5b3dc
add early exit on empty bucket slots in get()
joshuaisaact e580524
Revert "add early exit on empty bucket slots in get()"
joshuaisaact 5f62379
combined fp+empty SIMD check with per-tier early exit
joshuaisaact ac8b602
Revert "combined fp+empty SIMD check with per-tier early exit"
joshuaisaact 6f57a0a
double tier 0 size to concentrate elements for faster lookup
joshuaisaact f7328f9
quadruple tier 0 size (2x capacity / BUCKET_SIZE)
joshuaisaact be1c046
Revert "quadruple tier 0 size (2x capacity / BUCKET_SIZE)"
joshuaisaact 710938e
reduce MAX_LOOKUP_TIERS from 8 to 4 with larger tier 0
joshuaisaact 3d427e3
reduce MAX_LOOKUP_TIERS from 4 to 3
joshuaisaact 6188a77
reduce MAX_LOOKUP_TIERS from 3 to 2
joshuaisaact c9ea0fe
reduce MAX_LOOKUP_TIERS to 1 (tier 0 only)
joshuaisaact 8c9b749
simplify get() to single tier 0 loop, no tier iteration
joshuaisaact eaf5bbe
Revert "simplify get() to single tier 0 loop, no tier iteration"
joshuaisaact a4c6c90
use inline for to unroll probe loop in get()
joshuaisaact db48705
Revert "use inline for to unroll probe loop in get()"
joshuaisaact 4841fba
reduce MAX_PROBES from 20 to 10
joshuaisaact 4260bd4
reduce MAX_PROBES from 10 to 8
joshuaisaact 004b0b7
reduce MAX_PROBES from 8 to 6
joshuaisaact 03b078a
Revert "reduce MAX_PROBES from 8 to 6"
joshuaisaact 2f6f0b8
increase BUCKET_SIZE to 32 for AVX2, generic mask types
joshuaisaact 94cd8b0
Revert "increase BUCKET_SIZE to 32 for AVX2, generic mask types"
joshuaisaact 784e8ca
try quadratic probing instead of linear
joshuaisaact eb3837d
Revert "try quadratic probing instead of linear"
joshuaisaact 6f15cd3
inline for unroll 8-probe loop and eliminate tier loop
joshuaisaact ab64ffb
Revert "inline for unroll 8-probe loop and eliminate tier loop"
joshuaisaact dd36b35
simplify get() to tier-0-only with comptime MAX_PROBES loop bound
joshuaisaact e070ea9
remove insert prefetch
joshuaisaact 9ab9cc0
delay batch transition to 90% fill (was 75%) to keep more in tier 0
joshuaisaact ace8f81
limit insertIntoTier probe depth to MAX_PROBES for lookup alignment
joshuaisaact 4a423da
try 4x tier 0 capacity with probe-limited insert
joshuaisaact dd35c71
Revert "try 4x tier 0 capacity with probe-limited insert"
joshuaisaact bd6cd0b
tighten batch threshold to 0.08 (92% fill)
joshuaisaact c6a1a44
optimize remove() to tier-0-only with comptime loop bound
joshuaisaact 5b88308
try faster 64-bit multiply hash
joshuaisaact 56af0b9
try single-multiply hash
joshuaisaact 97c8fa3
noinline findKeyInBucket to keep hot path tight
joshuaisaact 49e1c9c
Revert "noinline findKeyInBucket to keep hot path tight"
joshuaisaact 532520f
fast path for single FP match in findKeyInBucket
joshuaisaact 2ae4de7
cache tier0 metadata in struct fields for faster get/remove
joshuaisaact 022c478
hardcode tier0_start=0, eliminate addition in get/remove
joshuaisaact 6f82228
remove unused tier0_start field
joshuaisaact 7ab6486
try MAX_PROBES=6 again with all other optimizations
joshuaisaact 2ce441c
Revert "try MAX_PROBES=6 again with all other optimizations"
joshuaisaact c9e700c
cache bucket mask directly, eliminate subtraction per probe
joshuaisaact 740bc74
prefetch keys for probe 0 to hide key read latency
joshuaisaact 1e118e7
Revert "prefetch keys for probe 0 to hide key read latency"
joshuaisaact bfd53c3
branchless fingerprint using 7 bits + 1 (range 1-128)
joshuaisaact 2200af5
Revert "branchless fingerprint using 7 bits + 1 (range 1-128)"
joshuaisaact 23f2a59
reorder struct fields: hot path first for cache line alignment
joshuaisaact 37c7627
try inline for with precomputed probe array
joshuaisaact f6b581a
Revert "try inline for with precomputed probe array"
joshuaisaact 90f051a
use bits 32-39 for fingerprint, less correlation with bucket index
joshuaisaact 45de2ab
try bits 24-31 for fingerprint
joshuaisaact b3fdcef
ultra-fast path: check slot 0 with validity check
joshuaisaact e4a3802
separate probe 0 check to avoid redundant work in main loop
joshuaisaact bb13814
unroll probes 0 and 1 before the loop
joshuaisaact 4718a34
Revert "unroll probes 0 and 1 before the loop"
joshuaisaact 53ffd0a
prefetch values for probe 0 while SIMD runs
joshuaisaact ebb9f26
Revert "prefetch values for probe 0 while SIMD runs"
joshuaisaact e22f93b
branchless fingerprint using max/min clamp
joshuaisaact 97389fa
Revert "branchless fingerprint using max/min clamp"
joshuaisaact 6600b40
batch threshold 0.06 (94% fill)
joshuaisaact f0f51aa
Revert "batch threshold 0.06 (94% fill)"
joshuaisaact 28eded4
batch threshold 0.12 (88% fill)
joshuaisaact 4e2f019
batch threshold 0.15 (85% fill)
joshuaisaact f380498
Revert "batch threshold 0.15 (85% fill)"
joshuaisaact 7aa8cd6
add experiment results, benchmark harness, and autoresearch artifacts
joshuaisaact 1393871
add rigorous abseil comparison, correct stale README claims
joshuaisaact c886b94
add program-v2: optimization target is now abseil flat_hash_map
joshuaisaact 34e40e3
add AGENTS.md with non-discoverable repo guidance
joshuaisaact 7807ffe
interleave keys and values into entries array for cache locality
joshuaisaact 236afac
reduce MAX_PROBES 8 to 7
joshuaisaact 0675665
reduce MAX_PROBES 7 to 6
joshuaisaact 984921d
Revert "reduce MAX_PROBES 7 to 6"
joshuaisaact 7b36a0a
two-round multiply hash for better fingerprint distribution
joshuaisaact 83d89ba
Revert "two-round multiply hash for better fingerprint distribution"
joshuaisaact fb76b19
use bits 56-63 for fingerprint (highest byte, most independent)
joshuaisaact fbfbbf2
Revert "use bits 56-63 for fingerprint (highest byte, most independent)"
joshuaisaact fce62d7
batch threshold 0.08 (92% fill before tier 1)
joshuaisaact 4cb90c7
Revert "batch threshold 0.08 (92% fill before tier 1)"
joshuaisaact 93e4a08
early termination in get() on empty fingerprint slots
joshuaisaact f21570b
Revert "early termination in get() on empty fingerprint slots"
joshuaisaact 2cf2a29
return value directly from findValueInBucket to avoid re-indexing
joshuaisaact 749a1e9
mark get() as inline
joshuaisaact 6fa6f37
Revert "mark get() as inline"
joshuaisaact 68cb2f5
golden ratio hash constant 0x9E3779B97F4A7C15
joshuaisaact cedfe75
Revert "golden ratio hash constant 0x9E3779B97F4A7C15"
joshuaisaact 46f1ae9
minimal early termination via matchEmpty after findValueInBucket miss
joshuaisaact 6768f0c
Revert "minimal early termination via matchEmpty after findValueInBuc…
joshuaisaact c5b0caa
batch threshold 0.14 (86% fill before tier 1)
joshuaisaact 50b619b
Revert "batch threshold 0.14 (86% fill before tier 1)"
joshuaisaact f97dad2
always try tier 0 first in insert to maximize get() hit rate
joshuaisaact 5566394
Revert "always try tier 0 first in insert to maximize get() hit rate"
joshuaisaact 8d00aa7
prefetch entries for probe 0 before fingerprint check
joshuaisaact ad3bf7c
also prefetch entries for probe 1 while checking probe 0
joshuaisaact 5b39875
Revert "also prefetch entries for probe 1 while checking probe 0"
joshuaisaact 4cc0deb
prefetch with locality=1 (low temporal) for entries
joshuaisaact 97803b7
Revert "prefetch with locality=1 (low temporal) for entries"
joshuaisaact a355018
remove xor-shift from hash (just multiply)
joshuaisaact 758f93f
use upper hash bits for bucket index (better distribution without xor…
joshuaisaact dd25176
merge probe 0 back into loop (prefetch before loop)
joshuaisaact 713b666
reduce MAX_PROBES 7 to 6 (upper-bit hash has better distribution)
joshuaisaact e5a018e
Revert "reduce MAX_PROBES 7 to 6 (upper-bit hash has better distribut…
joshuaisaact 56102de
branchless fingerprint clamping with @max/@min
joshuaisaact 2722e1e
Revert "branchless fingerprint clamping with @max/@min"
joshuaisaact eba4f60
prefetch both fingerprints and entries for probe 0
joshuaisaact e431166
Revert "prefetch both fingerprints and entries for probe 0"
joshuaisaact 248089c
use for range instead of while in get() probe loop
joshuaisaact d1db70b
inline for in get() probe loop (comptime unroll)
joshuaisaact 927dc4a
Revert "inline for in get() probe loop (comptime unroll)"
joshuaisaact c51c9e9
batch threshold 0.10 (90% fill)
joshuaisaact c102884
Revert "batch threshold 0.10 (90% fill)"
joshuaisaact 269d9c4
stride-3 probing to reduce secondary clustering
joshuaisaact d8c6338
Revert "stride-3 probing to reduce secondary clustering"
joshuaisaact 56755e5
prefetch two cache lines of entries for probe 0
joshuaisaact ca489d3
Revert "prefetch two cache lines of entries for probe 0"
joshuaisaact df2c24a
try Murmur3 mixing constant 0xbf58476d1ce4e5b9
joshuaisaact 21ed7db
Revert "try Murmur3 mixing constant 0xbf58476d1ce4e5b9"
joshuaisaact fa8f1ec
rigorous benchmark: random keys, fair capacity, median of 10, miss test
joshuaisaact 573d1cf
abseil-style control encoding: empty=0x80, 7-bit fps, cheap early ter…
joshuaisaact 76f7903
Revert "abseil-style control encoding: empty=0x80, 7-bit fps, cheap e…
joshuaisaact 4c2dd16
early termination via matchEmpty (retry with random keys benchmark)
joshuaisaact 456a5f0
Revert "early termination via matchEmpty (retry with random keys benc…
joshuaisaact 4fe8fdb
move prefetch inside probe loop (abseil pattern: one per iteration)
joshuaisaact 7a37536
Revert "move prefetch inside probe loop (abseil pattern: one per iter…
joshuaisaact 5f0ba61
restore xor-shift in hash (may help random key distribution)
joshuaisaact 3b49a27
xor-shift >> 28 for more upper-bit mixing
joshuaisaact 86c8d0d
Revert "xor-shift >> 28 for more upper-bit mixing"
joshuaisaact 7d2318e
add v2 experiment log and insights
joshuaisaact 5e8eaeb
gitignore benchmark binaries and logs
joshuaisaact edf731b
independent fingerprint hash (second multiply, parallel on superscalar)
joshuaisaact 7e51903
Revert "independent fingerprint hash (second multiply, parallel on su…
joshuaisaact a950f10
test 7-bit fingerprint from top bits (abseil H2 style)
joshuaisaact 112a5c2
Revert "test 7-bit fingerprint from top bits (abseil H2 style)"
joshuaisaact b426a0c
update insights with honest benchmark findings
joshuaisaact b08912e
use findEmptyInBucket in insert (skip tombstone check during bulk ins…
joshuaisaact ac3797f
Revert "use findEmptyInBucket in insert (skip tombstone check during …
joshuaisaact 35aff5d
paper-faithful multi-tier get(): search all tiers, not just tier 0
joshuaisaact 178c3ad
limit multi-tier search to tier 0 + tier 1 only
joshuaisaact c5ae13d
tier 1: probe 0 only (minimal code, finds ~95% of tier-1 elements)
joshuaisaact 1129bce
Revert "tier 1: probe 0 only (minimal code, finds ~95% of tier-1 elem…
joshuaisaact 9188eba
Reapply "tier 1: probe 0 only (minimal code, finds ~95% of tier-1 ele…
joshuaisaact 818f9bf
Revert "Reapply "tier 1: probe 0 only (minimal code, finds ~95% of ti…
joshuaisaact 0df48a8
restore tier-0-only get() (multi-tier causes compiler cascading regre…
joshuaisaact d511474
update README with honest benchmark results and paper divergence notes
joshuaisaact aaa9168
add program-v3: paper-faithful multi-tier lookup research
joshuaisaact 83f9649
noinline getSlowPath for tier 1+ search (keep get() I-cache footprint…
joshuaisaact b725080
limit getSlowPath to tier 1 only (tiers 2+ are empty at 99% load)
joshuaisaact 65aee5b
Revert "limit getSlowPath to tier 1 only (tiers 2+ are empty at 99% l…
joshuaisaact 48345e4
Reapply "limit getSlowPath to tier 1 only (tiers 2+ are empty at 99% …
joshuaisaact 0c05a9b
restore tier-0-only get() as clean base for insert-side experiments
joshuaisaact 2696554
MAX_PROBES 7 -> 8 (more elements fit in tier 0)
joshuaisaact 93a99b8
batch threshold 0.05 (95% fill) with MAX_PROBES=8
joshuaisaact e5abd48
Revert "batch threshold 0.05 (95% fill) with MAX_PROBES=8"
joshuaisaact 3a03edf
Reapply "batch threshold 0.05 (95% fill) with MAX_PROBES=8"
joshuaisaact c6f6467
restore clean v3 baseline (MAX_PROBES=7, threshold=0.12)
joshuaisaact 7736f8b
paper-faithful get(): noinline cold-hinted getOtherTiers for tier 1+
joshuaisaact b2448b0
use opaque function pointer for tier 1+ overflow (prevent LLVM codege…
joshuaisaact 906ae65
limit overflow to tier 1 only (tiers 2+ empty at 99% load)
joshuaisaact 776cd39
log v3 experiments
joshuaisaact 6f7d750
early termination in overflow function (safe with 100% find rate)
joshuaisaact 32dd07a
early termination in tier-0 loop (jump to overflow on empty slot)
joshuaisaact 9837b5d
Revert "early termination in tier-0 loop (jump to overflow on empty s…
joshuaisaact 1c47413
log v3 early termination experiments
joshuaisaact 94c1d05
v3 insights: elastic hash 40-65% faster than abseil at normal loads
joshuaisaact ad1bfed
add verification program: are these results real?
joshuaisaact ab25178
verification checks 1-3: capacity, shuffled access, size independence
joshuaisaact 71f0fb9
complete verification: realistic workloads, hash cost, shuffled access
joshuaisaact 7e7a711
update README and PR with verified benchmark results
joshuaisaact 65683f7
verify: gcc vs clang abseil - no unfair compiler advantage
joshuaisaact f466836
string-key elastic hash: implementation, benchmarks, runner
joshuaisaact 27303d9
string key verification: 36-97% faster than abseil even with shuffled…
joshuaisaact 06ba5b9
comprehensive string key verification: advantage holds across lengths…
joshuaisaact d788da5
cross-language benchmark: elastic hash vs abseil, Rust hashbrown, Go …
joshuaisaact 6aecf81
fair cross-language benchmark: ahash for Rust, pre-allocated strings …
joshuaisaact 3be46ed
add M4 benchmark guide: test whether cache density advantage is x86-s…
joshuaisaact 93444af
M4 benchmark: advantage grows to 2.59x (up from 1.74x on x86)
joshuaisaact a18b161
M4 cross-language size sweep: elastic hash wins 16K-4M against all co…
joshuaisaact File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,8 @@ | ||
| .zig-cache | ||
| zig-out | ||
| .claude | ||
| NOTES.md | ||
| bench.log | ||
| bench-abseil | ||
| abseil-v2.log | ||
| elastic-v2.log |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # AGENTS.md | ||
|
|
||
| ## Frozen files | ||
|
|
||
| `src/simple.zig` and `src/bench.zig` are reference implementations. Do not modify them. | ||
|
|
||
| ## Autoresearch programs | ||
|
|
||
| `program.md` and `program-v2.md` are autonomous agent programs (not documentation). When asked to "start" or "run" one, read it fully and execute its loop. Each defines its own set of frozen files, editable files, and keep/revert criteria -- read before editing anything. | ||
|
|
||
| ## Zig skills | ||
|
|
||
| The skills `zig-perf`, `zig-quality`, `zig-safety`, `zig-style`, and `zig-testing` are available globally. | ||
|
|
||
| ## Abseil comparison benchmarks | ||
|
|
||
| The abseil benchmark (`bench-abseil.cpp`, created by program-v2) requires system-installed `abseil-cpp` with pkg-config modules: `absl_hash`, `absl_raw_hash_set`, `absl_hashtablez_sampler`. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,170 @@ | ||
| # Running benchmarks on Apple Silicon M4 | ||
|
|
||
| ## What we're testing | ||
|
|
||
| On x86 with ~512KB L2, elastic hash beats abseil by 36-97% on string lookups because our tier-0 fingerprints (1MB) fit in L2 while abseil's control bytes (2MB) spill to L3. | ||
|
|
||
| M4 has ~16MB shared L2. Both arrays should fit in L2. If the advantage disappears, the result is cache-density-specific. If it persists, something deeper is happening. | ||
|
|
||
| ## Setup | ||
|
|
||
| ### Install dependencies | ||
|
|
||
| ```bash | ||
| # Zig | ||
| brew install zig | ||
|
|
||
| # Abseil | ||
| brew install abseil | ||
|
|
||
| # Rust | ||
| curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh | ||
|
|
||
| # Go | ||
| brew install go | ||
| ``` | ||
|
|
||
| ### Clone and checkout | ||
|
|
||
| ```bash | ||
| git clone https://github.com/joshuaisaact/elastic-hash.git | ||
| cd elastic-hash | ||
| git checkout autoresearch/cross-language-bench | ||
| ``` | ||
|
|
||
| ### Build everything | ||
|
|
||
| ```bash | ||
| # Abseil benchmark | ||
| # Note: pkg-config paths may differ on macOS. Try: | ||
| g++ -O3 -march=native -DNDEBUG -DABSL_HASHTABLEZ_SAMPLE_PARAMETER=0 \ | ||
| bench-abseil-strings.cpp -o bench-abseil-strings \ | ||
| $(pkg-config --cflags --libs absl_hash absl_raw_hash_set absl_hashtablez_sampler) | ||
|
|
||
| # If pkg-config doesn't work, try: | ||
| # g++ -O3 -march=native -DNDEBUG -DABSL_HASHTABLEZ_SAMPLE_PARAMETER=0 \ | ||
| # bench-abseil-strings.cpp -o bench-abseil-strings \ | ||
| # -I/opt/homebrew/include -L/opt/homebrew/lib \ | ||
| # -labsl_hash -labsl_raw_hash_set -labsl_hashtablez_sampler \ | ||
| # -labsl_city -labsl_low_level_hash -labsl_strings -labsl_int128 \ | ||
| # -labsl_base -labsl_throw_delegate -labsl_raw_logging_internal | ||
|
|
||
| # Elastic hash (Zig) | ||
| zig build test # verify tests pass | ||
| zig build autobench-strings -Doptimize=ReleaseFast # just to check it builds | ||
|
|
||
| # Rust | ||
| cd bench-rust && cargo build --release && cd .. | ||
|
|
||
| # Go | ||
| cd bench-go && go build -o bench-go . && cd .. | ||
| ``` | ||
|
|
||
| ## Run the benchmarks | ||
|
|
||
| ### Quick test (just abseil vs elastic at 1M 50%) | ||
|
|
||
| ```bash | ||
| bash bench-strings.sh | ||
| ``` | ||
|
|
||
| ### Full cross-language comparison | ||
|
|
||
| Run each one and save the output: | ||
|
|
||
| ```bash | ||
| # Abseil | ||
| ./bench-abseil-strings > results-m4-abseil.log 2>/dev/null | ||
| cat results-m4-abseil.log | ||
|
|
||
| # Elastic hash | ||
| zig build autobench-strings -Doptimize=ReleaseFast 2> results-m4-elastic.log | ||
| cat results-m4-elastic.log | ||
|
|
||
| # Rust (with ahash) | ||
| ./bench-rust/target/release/bench-hashbrown > results-m4-rust.log 2>/dev/null | ||
| cat results-m4-rust.log | ||
|
|
||
| # Go | ||
| ./bench-go/bench-go > results-m4-go.log 2>/dev/null | ||
| cat results-m4-go.log | ||
| ``` | ||
|
|
||
| ### Shuffled verification (the most important test) | ||
|
|
||
| ```bash | ||
| # Abseil shuffled | ||
| g++ -O3 -march=native -DNDEBUG -DABSL_HASHTABLEZ_SAMPLE_PARAMETER=0 \ | ||
| bench-strings-verify.cpp -o bench-strings-verify \ | ||
| $(pkg-config --cflags --libs absl_hash absl_raw_hash_set absl_hashtablez_sampler) | ||
| ./bench-strings-verify | ||
|
|
||
| # Elastic hash shuffled (swap autobench temporarily) | ||
| cp src/autobench.zig src/autobench.zig.bak | ||
| cp src/autobench-strings-verify.zig src/autobench.zig | ||
| zig build autobench -Doptimize=ReleaseFast 2>&1 | grep ELASTIC | ||
| cp src/autobench.zig.bak src/autobench.zig | ||
| rm src/autobench.zig.bak | ||
| ``` | ||
|
|
||
| ## What to look for | ||
|
|
||
| ### Prediction: advantage shrinks or disappears on M4 | ||
|
|
||
| M4's ~16MB L2 fits both our 1MB fingerprints AND abseil's 2MB control bytes. The L2 vs L3 cache density advantage that drives our x86 results should not apply. | ||
|
|
||
| If the shuffled hit lookup gap at 1M 50% is: | ||
| - **> 1.3x**: The advantage is NOT just cache density. Something else is going on. | ||
| - **1.0-1.3x**: Advantage shrinks as predicted. Cache density was the main factor. | ||
| - **< 1.0x**: Abseil wins on M4. Our architecture only helps on small-L2 x86. | ||
|
|
||
| ### Also check | ||
|
|
||
| - Does the size-dependent pattern hold? (Advantage at 1M but not 100K or 4M?) | ||
| - Is Rust+ahash still faster than abseil on M4? | ||
| - Does Go's performance change relative to the native-compiled implementations? | ||
|
|
||
| ## Results | ||
|
|
||
| ### Shuffled hit lookup (the key test) | ||
|
|
||
| | Load | Elastic (Zig) | Abseil (C++) | M4 ratio | x86 ratio | | ||
| |------|--------------|-------------|----------|-----------| | ||
| | 10% | 719 | 2,861 | **3.98x** | 1.97x | | ||
| | 25% | 2,276 | 10,169 | **4.47x** | 1.86x | | ||
| | 50% | 8,863 | 22,984 | **2.59x** | 1.74x | | ||
| | 75% | 15,972 | 33,624 | **2.11x** | 1.61x | | ||
| | 90% | 22,118 | 41,671 | **1.88x** | 1.50x | | ||
| | 99% | 25,748 | 46,543 | **1.81x** | 1.36x | | ||
|
|
||
| ### Verdict | ||
|
|
||
| The prediction was wrong. The advantage is **not** cache-density-specific. At 50% load the gap went from 1.74x on x86 to 2.59x on M4 -- it grew by 49%. | ||
|
|
||
| The mechanism is cache lines touched per probe, not which cache level the data lives in. Separated, dense fingerprint arrays mean fewer cache line fetches under random access, and this holds regardless of L2 size. | ||
|
|
||
| ### x86 reference (from Linux, AMD/Intel ~512KB L2) | ||
|
|
||
| | Load | Elastic | Abseil | Rust+ahash | Go swiss | | ||
| |------|---------|--------|-----------|---------| | ||
| | 50% | 11,119 | 19,312 | 16,235 | 25,304 | | ||
| | 99% | 33,318 | 45,404 | 36,292 | 57,488 | | ||
|
|
||
| Gap at 50%: elastic 1.74x faster than abseil, 1.46x faster than Rust+ahash. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### abseil won't build on macOS | ||
|
|
||
| Try `brew install abseil` then check `pkg-config --libs absl_hash`. If pkg-config can't find it: | ||
| ```bash | ||
| export PKG_CONFIG_PATH="/opt/homebrew/lib/pkgconfig:$PKG_CONFIG_PATH" | ||
| ``` | ||
|
|
||
| ### Zig SIMD on ARM | ||
|
|
||
| Zig's `@Vector` operations compile to ARM NEON on aarch64. The SIMD fingerprint matching should work without changes, but the generated instructions differ from SSE2. If tests fail, there may be an alignment or endianness issue. | ||
|
|
||
| ### Go swiss.Map crashes | ||
|
|
||
| If `swiss.Map` crashes with a segfault, ensure you're using pre-allocated strings (the current code on this branch already does this). | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.