elastic hash: generic library with full API, 1.7x faster than abseil#7
Open
joshuaisaact wants to merge 220 commits intomainfrom
Open
elastic hash: generic library with full API, 1.7x faster than abseil#7joshuaisaact wants to merge 220 commits intomainfrom
joshuaisaact wants to merge 220 commits intomainfrom
Conversation
…for Go Rust+ahash is now the closest competitor (1.46x at 50%, 1.09x at 99%). Go swiss.Map improved from 6.27x to 2.28x with pre-allocated strings. Elastic hash still fastest at every load factor against every competitor.
Disproves cache-density hypothesis. On M4's 16MB L2 where both fingerprint arrays fit, elastic hash is *faster* relative to abseil than on x86. The win is cache lines touched per probe, not L2 vs L3 spill. Fixed linux-only timers in zig benchmarks for macOS/ARM.
…mpetitors Size sweep at 50% load with shuffled access on Apple M4. Elastic hash beats abseil (1.3-2.4x), Rust hashbrown+ahash (1.2-3x), and Go swiss.Map (1.8-3.5x) at every tested size. The x86 finding that small tables favored abseil does not hold on M4. Added proper build step for shuffled verification benchmark.
7 approaches to explore: branch-hinted early termination, conditional probing, per-bucket max probe depth, bloom filters, tombstone-free deletion, Robin Hood displacement tracking. Previous early termination attempts regressed hits -- this program specifically guards against that.
Add early termination in tier-0 get() via matchEmpty check with @branchHint(.cold). The branch predictor learns that hits never take the early exit, making the check nearly free on the hot path. Previous attempts without the cold hint regressed hits by 10-30%. Before (50% load, 1M, unshuffled): hit=8,404 miss=12,343 After: hit=5,040 miss=2,688 Abseil: hit=9,027 miss=3,047 Misses went from 4x slower than abseil to 13% faster. Also adds per-bucket max_probe_depth tracking (unused by get() for now but available for future experiments) and autobench-miss benchmark.
With the matchEmpty miss optimization, elastic hash now ranks #1 on hit lookup, miss lookup, insert, and delete at 50% load against abseil, Rust hashbrown+ahash, Go swiss.Map, and Go builtin map. Abseil only wins on misses at 75%+ load where tier 0 fills up.
At 50% load, churning through the entire table does not degrade hit or miss performance. Tombstones from deletes get reused by subsequent inserts, keeping the empty slot ratio stable. matchEmpty early termination remains effective under sustained mutation.
40% hit / 40% miss / 10% insert / 10% delete, 1M ops on pre-filled table. Elastic hash sustains 50M ops/sec vs abseil's 25M at 50% load. Advantage holds at 25% (2.2x) and 75% (1.7x).
…h longer keys Memory: elastic hash uses 1.00x abseil's memory at all capacities. The tiered layout distributes slots across tiers but total count is the same. Key lengths: elastic hash is tied at 16 bytes, 1.33x faster at 8 bytes, and scales to 2x faster at 256 bytes. Fingerprint pre-filtering avoids expensive key comparisons on false-positive hash matches.
…a structure Same elastic hash algorithm compiled with g++ (same as abseil) shows hit lookups roughly matching abseil (9,679 vs 8,609 at 50% load). Zig version is 2x faster than both (4,431). Insert/delete advantage IS structural: C++ elastic hash is still 3.9x faster on inserts and 2.4x faster on deletes vs abseil.
Same elastic hash algorithm in Zig, C++, and Rust at 50% load: - Hit lookup: Zig 4,431 / C++ 9,679 / Rust 12,888 / Abseil 8,609 - Insert: Zig 3,564 / C++ 2,644 / Rust 2,945 / Abseil 10,233 - Delete: Zig 1,584 / C++ 2,483 / Rust 4,189 / Abseil 5,875 Hit lookup advantage is clearly Zig/LLVM codegen (comptime unrolling). Insert advantage is clearly data structure (3-4x in all languages).
Fixed both ports to match Zig implementation exactly: - Full batch insertion logic (tryInsertWithLimit, probeLimit, etc.) - Raw pointers for key storage (no bounds-checked slices) - Explicit NEON SIMD intrinsics on ARM - Prefetch on probe 0 Results at 50% load (1M elements, unshuffled): Hit: Zig 5,016 / Rust 10,130 / C++ 9,959 / Abseil 8,712 Miss: Zig 2,535 / Rust 6,073 / C++ 5,991 / Abseil 2,955 Insert: Zig 3,630 / Rust 2,167 / C++ 3,830 / Abseil 10,315 Delete: Zig 1,613 / Rust 2,344 / C++ 2,940 / Abseil 5,888 Confirms: insert/delete advantage is data structure (3-5x in all languages). Hit lookup advantage is Zig-specific codegen.
4 implementations (elastic+linear, elastic+triangular, flat+linear, flat+triangular) compiled with identical g++ flags, same hash, same SIMD, same benchmark harness. Key findings at 1M 50% load: - Hit lookups: elastic ~10% faster than flat. Probing doesn't matter. - Inserts: flat is 2x faster (batch logic is overhead, not advantage) - Miss/delete: roughly tied across all variants. The 2-3x hit lookup advantage seen in Zig is compiler codegen, not data structure. The insert advantage vs abseil was abseil's overhead (growth policy, rehash checks), not elastic's structural win.
… remove Fixes: - Flat implementations no longer get 2x capacity (was halving effective load, massively biasing results) - Elastic remove now tier-0-only (matching Zig implementation) Corrected results (1M, 50% load, 3 runs): - Hit: flat ~9K, elastic ~10K, Zig ~5.1K - Miss: flat ~5.8K, elastic ~6K, Zig ~2.8K - Insert: flat ~1.8K, elastic ~3.9K, Zig ~3.8K - Delete: all C++ ~5-6K, Zig ~1.7K The tiered layout provides no measurable advantage over flat in C++. The Zig advantage (1.8-2x on lookups, 3.5x on deletes) is compiler codegen, not data structure.
Adding abseil-style needsResize() check to every insert (even when resize never triggers) adds 37% overhead at 50% load. Lookups and deletes are unaffected. This partially explains the 2.6x insert advantage vs abseil — simpler insert path matters.
Insert now checks for existing key before inserting — updates value if found, inserts new entry if not. Resize uses insertNew (skips duplicate check) to avoid double-resize during rehash. New tests: - duplicate key updates value (single key, 3 updates) - duplicate keys at scale (500 keys, re-insert all with new values) - resize triggers and preserves data (16 -> 64 elements) - resize with duplicates - resize then delete then re-insert
New API methods on StringElasticHashGrowth:
- contains(key) -> bool
- len() -> usize
- clear() - resets table without freeing memory
- getOrPut(key, default) -> { value_ptr, found_existing }
- iterator() -> yields all live key-value pairs, skips tombstones
19 tests covering: basic ops, duplicates, resize, clear, getOrPut
(new/existing/modify-via-pointer/at-scale), iterator (empty/full/
tombstones/after-clear).
Instead of two passes (search for existing key, then search for empty slot), the insert now does both in one pass. Tracks the first empty slot while scanning for the key. If key found, update. If not, insert into the tracked slot. Insert at 50% load: 7,408 -> 5,609 (24% faster). Now 1.83x faster than abseil with full duplicate handling and resize support.
The fast-path insert wasn't updating max_probe_depth. Fixed. Also tested max_probe_depth for miss early termination — doesn't help at high load because probe depths are near MAX_PROBES anyway. The high-load miss weakness is fundamental: without a Bloom filter (boost's approach), there's no cheap way to terminate misses when buckets are full.
Added per-bucket overflow bloom filter (2 bits from hash bits 40-47). Set on insert when element is displaced from home bucket. Checked in get() to terminate misses early at high load. Result: 54% hit regression at 50% load for marginal miss improvement at 75%. The extra memory load + AND + compare on probe 0 costs more than the miss savings. 25% false positive rate with 2 bits isn't selective enough. Bloom infrastructure remains in the struct (set on insert, cleared on resize/clear) but get() doesn't check it. High-load miss weakness is accepted — the table is optimized for 10-50% load where matchEmpty handles misses effectively.
ElasticHash(K, V, Context) — comptime-parameterized hash table. Context provides hash() and eql(), matching Zig's std.HashMap pattern. AutoContext(K) auto-generates hash/eql for integers, slices, arrays. AutoElasticHash(K, V) is the convenience alias. Full API: init, deinit, insert (with dedup), get, remove, contains, len, clear, getOrPut, iterator. All ported from string_hybrid_growth with the same SIMD, tiered layout, batch insertion, and cold-hinted early termination. 10 tests covering u64 keys, []const u8 keys, [16]u8 keys, duplicates, resize, getOrPut, iterator, and tombstone handling.
|
Important Review skippedToo many files! This PR contains 198 files, which is 48 over the limit of 150. ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (12)
📒 Files selected for processing (198)
You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Added unshuffled C++ elastic hash benchmark at all sizes + abseil at 50% load across sizes. Abseil wins at small tables (<64K) where L1 fits everything and tier overhead dominates. Elastic advantage starts at 256K (2.9x) and holds through 4M (1.2x). Previous FINDINGS.md claimed "slightly ahead at 16K" based on Zig data — the C++ comparison shows abseil is actually 3.8x faster there.
Previous 16K data (abseil 3.8x faster) was from concurrent runs with machine contention. Clean sequential runs show tied at 16K. Elastic C++ wins on hits at every size (16K-4M) and every load (10-90%). Peak advantage: 4.4x at 256K, 4.1x at 10% load. Only weakness: misses at 75%+ load (abseil 1.6-3.9x faster).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The elastic hash started as a benchmark experiment against Google's abseil
flat_hash_map. This PR turns it into a usable generic library with a production-ready API, validated across architectures and competitors.What
Generic library (
src/elastic_hash.zig)ElasticHash(K, V, Context)— works with any key/value typeAutoElasticHash(K, V)— convenience alias with auto-generated hash/eqlMiss optimization (
@branchHint(.cold))matchEmptycheck per probe with@branchHint(.cold)in get()Validation
Results (C++ elastic vs abseil, same g++ compiler, unshuffled)
Load factor sweep (1M elements)
Elastic hash wins on hits at every load factor up to 90%.
Size sweep (50% load)
Elastic hash is faster or tied at every size from 16K to 4M. Peak advantage: 4.4x at 256K.
vs boost::unordered_flat_map (Zig elastic, unshuffled, 1M, 50%)
Cross-architecture
Why it's faster
Separated dense fingerprint arrays. Fingerprints (1 byte/slot) in a contiguous array, separate from entries (24 bytes/slot). One cache line covers 64 fingerprints. Fewer cache line fetches per probe.
Simpler insert/delete. No growth policy checks, no hashtablez sampling. Tombstone marking vs find-then-erase.
Cold-hinted matchEmpty. Terminates misses early at low-mid load without regressing hits.
Why it's slower on misses at 75%+ load
Buckets are full.
matchEmptycan't find empty slots. Tested bloom filters (54% hit regression) and max probe depth tracking (no improvement) — the separated layout makes extra checks expensive. This is architectural.What we got wrong
Known limitations
References
FINDINGS.md