-
Notifications
You must be signed in to change notification settings - Fork 0
string-key elastic hash: 36-97% faster than abseil on lookup #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
a4a2c7a
25c5de2
330ddb6
d5d671a
f9fc247
18b4ce8
446e450
b48b0a3
6bbdd09
6ca9ddc
d42e9ee
33c9feb
9530f6c
1c52805
907f4f8
a8629ca
bbf67f0
6e5b3dc
e580524
5f62379
ac8b602
6f57a0a
f7328f9
be1c046
710938e
3d427e3
6188a77
c9ea0fe
8c9b749
eaf5bbe
a4c6c90
db48705
4841fba
4260bd4
004b0b7
03b078a
2f6f0b8
94cd8b0
784e8ca
eb3837d
6f15cd3
ab64ffb
dd36b35
e070ea9
9ab9cc0
ace8f81
4a423da
dd35c71
bd6cd0b
c6a1a44
5b88308
56af0b9
97c8fa3
49e1c9c
532520f
2ae4de7
022c478
6f82228
7ab6486
2ce441c
c9e700c
740bc74
1e118e7
bfd53c3
2200af5
23f2a59
37c7627
f6b581a
90f051a
45de2ab
b3fdcef
e4a3802
bb13814
4718a34
53ffd0a
ebb9f26
e22f93b
97389fa
6600b40
f0f51aa
28eded4
4e2f019
f380498
7aa8cd6
1393871
c886b94
34e40e3
7807ffe
236afac
0675665
984921d
7b36a0a
83d89ba
fb76b19
fbfbbf2
fce62d7
4cb90c7
93e4a08
f21570b
2cf2a29
749a1e9
6fa6f37
68cb2f5
cedfe75
46f1ae9
6768f0c
c5b0caa
50b619b
f97dad2
5566394
8d00aa7
ad3bf7c
5b39875
4cc0deb
97803b7
a355018
758f93f
dd25176
713b666
e5a018e
56102de
2722e1e
eba4f60
e431166
248089c
d1db70b
927dc4a
c51c9e9
c102884
269d9c4
d8c6338
56755e5
ca489d3
df2c24a
21ed7db
fa8f1ec
573d1cf
76f7903
4c2dd16
456a5f0
4fe8fdb
7a37536
5f0ba61
3b49a27
86c8d0d
7d2318e
5e8eaeb
edf731b
7e51903
a950f10
112a5c2
b426a0c
b08912e
ac3797f
35aff5d
178c3ad
c5ae13d
1129bce
9188eba
818f9bf
0df48a8
d511474
aaa9168
83f9649
b725080
65aee5b
48345e4
0c05a9b
2696554
93a99b8
e5abd48
3a03edf
c6f6467
7736f8b
b2448b0
906ae65
776cd39
6f7d750
32dd07a
9837b5d
1c47413
94c1d05
ad1bfed
ab25178
71f0fb9
7e7a711
65683f7
f466836
27303d9
06ba5b9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,8 @@ | ||
| .zig-cache | ||
| zig-out | ||
| .claude | ||
| NOTES.md | ||
| bench.log | ||
| bench-abseil | ||
| abseil-v2.log | ||
| elastic-v2.log |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,17 @@ | ||
| # AGENTS.md | ||
|
|
||
| ## Frozen files | ||
|
|
||
| `src/simple.zig` and `src/bench.zig` are reference implementations. Do not modify them. | ||
|
|
||
| ## Autoresearch programs | ||
|
|
||
| `program.md` and `program-v2.md` are autonomous agent programs (not documentation). When asked to "start" or "run" one, read it fully and execute its loop. Each defines its own set of frozen files, editable files, and keep/revert criteria -- read before editing anything. | ||
|
|
||
| ## Zig skills | ||
|
|
||
| The skills `zig-perf`, `zig-quality`, `zig-safety`, `zig-style`, and `zig-testing` are available globally. | ||
|
|
||
| ## Abseil comparison benchmarks | ||
|
|
||
| The abseil benchmark (`bench-abseil.cpp`, created by program-v2) requires system-installed `abseil-cpp` with pkg-config modules: `absl_hash`, `absl_raw_hash_set`, `absl_hashtablez_sampler`. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,163 +1,114 @@ | ||
| # elastic-hash-zig | ||
|
|
||
| > **Disclaimer:** I'm still learning Zig and there may be memory crimes. | ||
|
|
||
| Elastic hashing implementation in Zig. Based on [Elastic Hashing](https://arxiv.org/pdf/2501.02305). | ||
| SIMD hash table in Zig, inspired by [Optimal Bounds for Open Addressing Without Reordering](https://arxiv.org/abs/2501.02305) (Farach-Colton, Krapivin, Kuszmaul 2025). Uses the paper's tiered batch insertion and multi-tier lookup via opaque overflow. | ||
|
|
||
| Requires Zig 0.14+ (tested on 0.16.0-dev). | ||
|
|
||
| See my blog post for a walkthrough: [www.joshtuddenham.dev/blog/hashmaps](https://www.joshtuddenham.dev/blog/hashmaps) | ||
|
|
||
| ## Results | ||
|
|
||
| ### True 99% Load Factor (of actual capacity) | ||
|
|
||
| The hybrid implementation compared to Zig's `std.HashMap` at **true 99% of actual capacity**. std.HashMap is based on Google's [SwissTable](https://abseil.io/about/design/swisstables). | ||
|
|
||
| | Capacity | Insert | Lookup | | ||
| |----------|--------|--------| | ||
| | 16k | **4.34x** | **1.15x** | | ||
| | 65k | **7.92x** | **1.68x** | | ||
| | 262k | **4.77x** | 0.75x | | ||
| | 524k | **4.33x** | 0.78x | | ||
| | 1M | **4.40x** | 0.77x | | ||
| | 2M | **4.44x** | 0.72x | | ||
|
|
||
| **Insert is 4-8x faster** across all sizes at true 99% load. | ||
|
|
||
| **Lookup** wins at smaller sizes (16k-65k), loses ~25% at larger sizes due to φ-ordering cache effects. | ||
|
|
||
| ### Delete Performance | ||
|
|
||
| Delete at 99% load factor (deleting 50% of elements): | ||
|
|
||
| | Capacity | Delete | | ||
| |----------|--------| | ||
| | 16k | **1.72x** | | ||
| | 65k | **2.63x** | | ||
| | 262k | **1.29x** | | ||
| | 1M | **1.14x** | | ||
|
|
||
| **Delete is faster** across all sizes at high load, with bigger wins at smaller sizes. | ||
| ## vs Google's abseil `flat_hash_map` | ||
|
|
||
| ### Comptime vs Runtime | ||
| Benchmarked against `absl::flat_hash_map` (the original SwissTable) with u64 keys. Both sides use `reserve(n)` / `init(n)` for the same target capacity. Random keys via splitmix64, median of 10 runs, 2 warmup discards. Full methodology and verification in `verify-results.md`. | ||
|
|
||
| When capacity is known at compile time, the comptime version significantly outperforms runtime: | ||
| ### Hit lookup (shuffled random access, n=1,048,576) | ||
|
|
||
| | n | Insert | Lookup | | ||
| |---|--------|--------| | ||
| | 10k | **2.06x** | **4.63x** | | ||
| | 100k | **2.73x** | **6.71x** | | ||
| | 1M | **2.21x** | **2.36x** | | ||
| | Load | Gap (abseil/elastic) | Winner | | ||
| |------|---------------------|--------| | ||
| | 10% | **1.16** | Elastic 16% faster | | ||
| | 25% | **1.21** | Elastic 21% faster | | ||
| | 50% | **1.18** | Elastic 18% faster | | ||
| | 75% | **1.08** | Elastic 8% faster | | ||
| | 90% | 0.96 | Roughly tied | | ||
| | 99% | 0.86 | Abseil 14% faster | | ||
|
|
||
| ## Key Findings | ||
| ### Realistic workloads | ||
|
|
||
| ### What Works | ||
| | Workload | 100K | 500K | 1M | | ||
| |----------|------|------|-----| | ||
| | Mixed r/w (80% hit, 10% miss, 5% ins, 5% del) | 0.77 | **1.49** | 0.98 | | ||
| | Hot-key / zipf-like lookup | 0.72 | **1.07** | **1.29** | | ||
| | Build-then-read (insert N, 10N random reads) | 0.77 | 0.98 | 0.81 | | ||
|
|
||
| 1. **Insert-heavy workloads at high load**: 4-8x faster than std.HashMap at 99% load | ||
| 2. **Delete operations**: 1.1-2.6x faster than std.HashMap at high load | ||
| 3. **Known-capacity scenarios**: Comptime version is 2-7x faster | ||
| 4. **Small-to-medium datasets**: Both insert and lookup win up to ~65k elements | ||
| 5. **Worst-case guarantees**: O(log²(1/ε)) expected probes from the paper | ||
| ### Delete performance | ||
|
|
||
| ### What Doesn't Work | ||
| 2-3x faster than abseil at all sizes and loads. O(1) tombstone marking vs abseil's find-then-erase. | ||
|
|
||
| 1. **Lookup at large sizes**: φ-ordering causes cache misses when jumping between tiers | ||
| 2. **General-purpose replacement**: std.HashMap wins for typical mixed workloads | ||
| 3. **Memory locality**: Tiered structure hurts cache performance vs flat Swiss table | ||
| ### Where elastic hash wins | ||
|
|
||
| ### Why std.HashMap Still Wins on Lookup | ||
| **Hit lookups at 500K-2M elements, 10-75% load.** The tiered architecture keeps hot fingerprint metadata (1MB for tier 0) in L2 cache, while abseil's flat control byte array (2MB after reserve) spills to L3. This gives a ~15-20% advantage on random-access hit lookups in the sweet spot. | ||
|
|
||
| std.HashMap uses SIMD too (Swiss table design), plus: | ||
| - Flat memory layout (better cache locality) | ||
| - No tier jumping during probes | ||
| - Optimized for typical 80% load factor | ||
| **Mixed read/write workloads at 500K.** Up to 50% faster when the access pattern includes inserts and deletes alongside lookups. | ||
|
|
||
| The elastic hash pays a cache penalty for the φ-ordering that provides worst-case guarantees. | ||
| **Delete at all sizes.** 2-3x faster consistently. | ||
|
|
||
| ### vs Google's Original SwissTable (abseil) | ||
| ### Where abseil wins | ||
|
|
||
| Benchmarking against Google's `absl::flat_hash_map` (the original SwissTable) reveals both Zig implementations are significantly slower: | ||
| **Miss lookups: 2-3x faster.** Abseil's early termination on empty control byte groups stops miss probing after 1-2 groups. Our tiered structure scans 7 probes in tier 0 + 7 in tier 1 before concluding a miss. | ||
|
|
||
| | Operation | Google SwissTable | Zig std.HashMap | Elastic Hash | | ||
| |-----------|-------------------|-----------------|--------------| | ||
| | Insert 1M @ 99% | 57ms | 779ms | 217ms | | ||
| | Lookup 1M @ 99% | 43ms | 533ms | 1008ms | | ||
| **Small tables (<100K).** Everything fits in L1, our tier overhead costs more than it saves. | ||
|
|
||
| Google's implementation is **10-20x faster** than both Zig hashmaps. This is due to: | ||
| - Years of optimization by Google engineers | ||
| - Hand-tuned SIMD intrinsics for each platform | ||
| - Cache prefetching and memory layout optimizations | ||
| - 8-byte groups on ARM (vs 16-byte here) | ||
| **Large tables (>4M).** Neither side's metadata fits in L2; abseil's flat layout has slightly less overhead. | ||
|
|
||
| **The takeaway**: Within Zig, elastic hash wins on insert/delete. But abseil is in a different performance league entirely. | ||
| **High load (99%).** Tier 0 is nearly full, probe depths increase, and the metadata density advantage disappears. | ||
|
|
||
| ### Why We Win on Insert | ||
| ### Caveats | ||
|
|
||
| The batch insertion algorithm from the paper distributes elements efficiently: | ||
| - Fills tier 0 to 75%, then starts using tier 1 | ||
| - Uses probe limits based on empty fraction (ε) | ||
| - Avoids long probe chains that hurt std.HashMap at high load | ||
| - Tested with u64 keys only. Abseil's hash is designed for strings and composite keys; our multiply hash is integer-specialized. | ||
| - Single machine (x86_64, ~512KB L2). CPUs with different L2 sizes would shift the sweet spot. | ||
| - Compiled with g++ (abseil) vs Zig/LLVM (elastic hash). Different compiler backends may generate different code quality. | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### Real Elastic Hashing | ||
| ### Relationship to the paper | ||
|
|
||
| The implementation uses `tier0 = capacity/2` so elements actually spread across tiers: | ||
| - Tier 0: ~50% of elements | ||
| - Tier 1: ~25% of elements | ||
| - Tier 2: ~12.5% of elements | ||
| - etc. | ||
| **Insertion** follows the paper: tiered arrays with geometrically decreasing sizes, batch insertion with three cases based on tier fullness, and probe limits from the f(epsilon) function. | ||
|
|
||
| This is "real" elastic hashing as described in the paper, not just a single-tier SIMD hash table. | ||
| **Lookup** searches tier 0 (fast inline path), then calls through an opaque function pointer to check tier 1 (cold overflow path). The function pointer boundary prevents LLVM from cascading optimizations that bloat the hot loop. At 99% load, `get()` finds 100% of elements (97.3% in tier 0, 2.7% in tier 1 via overflow). Early termination on empty slots in the overflow function reduces miss cost in tier 1. | ||
|
|
||
| ### SIMD Fingerprint Scanning | ||
| ### SIMD bucketed probing | ||
|
|
||
| - 16-byte buckets scanned with SIMD vector comparison | ||
| - 8-bit fingerprints (top byte of hash, 0=empty, 0xFF=tombstone) | ||
| - 16-element buckets scanned with SSE2 vector comparison | ||
| - 8-bit fingerprints (bits 32-39 of hash), 0=empty, 0xFF=tombstone | ||
| - `@ctz` on bitmask for fast slot finding | ||
| - Tombstone-based deletion (like std.HashMap) | ||
| - Linear probing across buckets with upper-bit hash indexing | ||
|
|
||
| ### Separated Memory Layout | ||
| ### Memory layout | ||
|
|
||
| Fingerprints, keys, and values stored in separate arrays: | ||
| - Fingerprint scanning doesn't pollute cache with keys/values | ||
| - 4 buckets' fingerprints fit in one 64-byte cache line | ||
| - Fingerprints: separate dense array (1MB at 1M elements, fits in L2) | ||
| - Entries: interleaved key-value pairs (value load is free after key check -- same cache line) | ||
| - Software prefetch for entries at probe 0 (hides L3/DRAM latency for random access) | ||
|
|
||
| ## Files | ||
| ### Key parameters | ||
|
|
||
| - `src/simple.zig` - Minimal implementation (~100 lines). Start here if you're learning. | ||
| - `src/main.zig` - Optimized version with fingerprinting, batch insertion, and the φ priority function from the paper. | ||
| - `src/hybrid.zig` - SIMD-accelerated version with: | ||
| - `HybridElasticHash` - Runtime version | ||
| - `ComptimeHybridElasticHash` - Compile-time version (faster when capacity is known) | ||
| - `src/bench.zig` - Benchmarks | ||
| | Parameter | Value | Why | | ||
| |-----------|-------|-----| | ||
| | BUCKET_SIZE | 16 | One SSE2 comparison per bucket | | ||
| | MAX_PROBES | 7 | Minimum for 99% load correctness | | ||
| | Batch threshold | 0.12 | 88% fill in tier 0 before tier 1 | | ||
| | Hash | `key * c ^ (key * c >> 32)` | Single multiply, upper bits for bucket index | | ||
|
|
||
| ## Usage | ||
| ## Files | ||
|
|
||
| ### Test | ||
| - `src/simple.zig` - Minimal implementation (~100 lines). Start here. | ||
| - `src/main.zig` - Base implementation with fingerprinting and batch insertion. | ||
| - `src/hybrid.zig` - SIMD-accelerated version: | ||
| - `HybridElasticHash` - Runtime version (primary optimization target) | ||
| - `ComptimeHybridElasticHash` - Compile-time version | ||
| - `src/bench.zig` - Full benchmark suite | ||
| - `src/autobench.zig` - Focused benchmark for abseil comparison | ||
| - `bench-abseil.cpp` - Abseil benchmark (identical keys/capacity) | ||
| - `bench-realistic.cpp` - Realistic workload benchmarks | ||
| - `bench-v2.sh` - Runner that builds and compares both | ||
| - `verify-results.md` - Verification methodology and findings | ||
|
|
||
| ``` | ||
| zig build test | ||
| ``` | ||
|
|
||
| ### Benchmark | ||
| ## Usage | ||
|
|
||
| ``` | ||
| zig build bench | ||
| zig build test # run tests | ||
| zig build bench # full benchmark | ||
| bash bench-v2.sh # comparison vs abseil (requires abseil-cpp) | ||
| ``` | ||
|
Comment on lines
+90
to
110
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Document the new string-key path here too. The file list and usage section still only point readers at the u64 benchmark flow ( 🧰 Tools🪛 markdownlint-cli2 (0.21.0)[warning] 106-106: Fenced code blocks should have a language specified (MD040, fenced-code-language) 🤖 Prompt for AI Agents |
||
|
|
||
| ## Conclusion | ||
|
|
||
| **Is this useful?** Yes, for specific use cases: | ||
|
|
||
| | Use Case | Recommendation | | ||
| |----------|----------------| | ||
| | Write-heavy, high load (>95%) | **Use elastic hash** (4-8x insert win) | | ||
| | Delete-heavy, high load | **Use elastic hash** (1.1-2.6x delete win) | | ||
| | Known capacity at compile time | **Use ComptimeHybridElasticHash** (2-7x faster) | | ||
| | Small datasets (<65k) | **Use elastic hash** (wins both insert and lookup) | | ||
| | General purpose | Use std.HashMap | | ||
| | Read-heavy, large datasets | Use std.HashMap | | ||
| ## Optimization log | ||
|
|
||
| The elastic hash is not a drop-in replacement for std.HashMap, but it's a genuine win for write-heavy workloads at high load factors - which is exactly what the paper claimed. | ||
| 40+ experiments across three rounds. See `results-v2.tsv`, `results-v3.tsv` for logs and `insights-v2.md`, `insights-v3.md`, `verify-results.md` for analysis. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| RESULT n=16384 load=99 insert_us=115 lookup_us=132 delete_us=96 miss_us=81 | ||
| RESULT n=65536 load=99 insert_us=495 lookup_us=304 delete_us=242 miss_us=211 | ||
| RESULT n=262144 load=99 insert_us=2294 lookup_us=1789 delete_us=1384 miss_us=1435 | ||
| RESULT n=1048576 load=99 insert_us=23271 lookup_us=18103 delete_us=12619 miss_us=12170 | ||
| RESULT n=1048576 load=10 insert_us=10814 lookup_us=1542 delete_us=1176 miss_us=774 | ||
| RESULT n=1048576 load=25 insert_us=12422 lookup_us=4443 delete_us=3186 miss_us=2102 | ||
| RESULT n=1048576 load=50 insert_us=15929 lookup_us=8952 delete_us=6563 miss_us=5145 | ||
| RESULT n=1048576 load=75 insert_us=19171 lookup_us=13214 delete_us=9131 miss_us=8095 | ||
| RESULT n=1048576 load=90 insert_us=21959 lookup_us=16550 delete_us=11406 miss_us=10720 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconcile the cache explanation with the stated test machine.
This section says the 1MB tier-0 fingerprint array fits in L2, but the caveats later say the measurements came from a machine with roughly 512KB L2. Both cannot be true at the same time, so the current explanation is internally inconsistent.
Also applies to: 56-58
🤖 Prompt for AI Agents