joshuaisaact · joshuaisaact · Mar 23, 2026 · Mar 23, 2026 · Mar 23, 2026 · Mar 23, 2026
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,8 @@
 .zig-cache
+zig-out
 .claude
 NOTES.md
+bench.log
+bench-abseil
+abseil-v2.log
+elastic-v2.log
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1,17 @@
+# AGENTS.md
+
+## Frozen files
+
+`src/simple.zig` and `src/bench.zig` are reference implementations. Do not modify them.
+
+## Autoresearch programs
+
+`program.md` and `program-v2.md` are autonomous agent programs (not documentation). When asked to "start" or "run" one, read it fully and execute its loop. Each defines its own set of frozen files, editable files, and keep/revert criteria -- read before editing anything.
+
+## Zig skills
+
+The skills `zig-perf`, `zig-quality`, `zig-safety`, `zig-style`, and `zig-testing` are available globally.
+
+## Abseil comparison benchmarks
+
+The abseil benchmark (`bench-abseil.cpp`, created by program-v2) requires system-installed `abseil-cpp` with pkg-config modules: `absl_hash`, `absl_raw_hash_set`, `absl_hashtablez_sampler`.
diff --git a/README.md b/README.md
@@ -1,163 +1,114 @@
 # elastic-hash-zig
 
-> **Disclaimer:** I'm still learning Zig and there may be memory crimes.
-
-Elastic hashing implementation in Zig. Based on [Elastic Hashing](https://arxiv.org/pdf/2501.02305).
+SIMD hash table in Zig, inspired by [Optimal Bounds for Open Addressing Without Reordering](https://arxiv.org/abs/2501.02305) (Farach-Colton, Krapivin, Kuszmaul 2025). Uses the paper's tiered batch insertion and multi-tier lookup via opaque overflow.
 
 Requires Zig 0.14+ (tested on 0.16.0-dev).
 
 See my blog post for a walkthrough: [www.joshtuddenham.dev/blog/hashmaps](https://www.joshtuddenham.dev/blog/hashmaps)
 
-## Results
-
-### True 99% Load Factor (of actual capacity)
-
-The hybrid implementation compared to Zig's `std.HashMap` at **true 99% of actual capacity**. std.HashMap is based on Google's [SwissTable](https://abseil.io/about/design/swisstables).
-
-| Capacity | Insert | Lookup |
-|----------|--------|--------|
-| 16k | **4.34x** | **1.15x** |
-| 65k | **7.92x** | **1.68x** |
-| 262k | **4.77x** | 0.75x |
-| 524k | **4.33x** | 0.78x |
-| 1M | **4.40x** | 0.77x |
-| 2M | **4.44x** | 0.72x |
-
-**Insert is 4-8x faster** across all sizes at true 99% load.
-
-**Lookup** wins at smaller sizes (16k-65k), loses ~25% at larger sizes due to φ-ordering cache effects.
-
-### Delete Performance
-
-Delete at 99% load factor (deleting 50% of elements):
-
-| Capacity | Delete |
-|----------|--------|
-| 16k | **1.72x** |
-| 65k | **2.63x** |
-| 262k | **1.29x** |
-| 1M | **1.14x** |
-
-**Delete is faster** across all sizes at high load, with bigger wins at smaller sizes.
+## vs Google's abseil `flat_hash_map`
 
-### Comptime vs Runtime
+Benchmarked against `absl::flat_hash_map` (the original SwissTable) with u64 keys. Both sides use `reserve(n)` / `init(n)` for the same target capacity. Random keys via splitmix64, median of 10 runs, 2 warmup discards. Full methodology and verification in `verify-results.md`.
 
-When capacity is known at compile time, the comptime version significantly outperforms runtime:
+### Hit lookup (shuffled random access, n=1,048,576)
 
-| n | Insert | Lookup |
-|---|--------|--------|
-| 10k | **2.06x** | **4.63x** |
-| 100k | **2.73x** | **6.71x** |
-| 1M | **2.21x** | **2.36x** |
+| Load | Gap (abseil/elastic) | Winner |
+|------|---------------------|--------|
+| 10% | **1.16** | Elastic 16% faster |
+| 25% | **1.21** | Elastic 21% faster |
+| 50% | **1.18** | Elastic 18% faster |
+| 75% | **1.08** | Elastic 8% faster |
+| 90% | 0.96 | Roughly tied |
+| 99% | 0.86 | Abseil 14% faster |
 
-## Key Findings
+### Realistic workloads
 
-### What Works
+| Workload | 100K | 500K | 1M |
+|----------|------|------|-----|
+| Mixed r/w (80% hit, 10% miss, 5% ins, 5% del) | 0.77 | **1.49** | 0.98 |
+| Hot-key / zipf-like lookup | 0.72 | **1.07** | **1.29** |
+| Build-then-read (insert N, 10N random reads) | 0.77 | 0.98 | 0.81 |
 
-1. **Insert-heavy workloads at high load**: 4-8x faster than std.HashMap at 99% load
-2. **Delete operations**: 1.1-2.6x faster than std.HashMap at high load
-3. **Known-capacity scenarios**: Comptime version is 2-7x faster
-4. **Small-to-medium datasets**: Both insert and lookup win up to ~65k elements
-5. **Worst-case guarantees**: O(log²(1/ε)) expected probes from the paper
+### Delete performance
 
-### What Doesn't Work
+2-3x faster than abseil at all sizes and loads. O(1) tombstone marking vs abseil's find-then-erase.
 
-1. **Lookup at large sizes**: φ-ordering causes cache misses when jumping between tiers
-2. **General-purpose replacement**: std.HashMap wins for typical mixed workloads
-3. **Memory locality**: Tiered structure hurts cache performance vs flat Swiss table
+### Where elastic hash wins
 
-### Why std.HashMap Still Wins on Lookup
+**Hit lookups at 500K-2M elements, 10-75% load.** The tiered architecture keeps hot fingerprint metadata (1MB for tier 0) in L2 cache, while abseil's flat control byte array (2MB after reserve) spills to L3. This gives a ~15-20% advantage on random-access hit lookups in the sweet spot.
 
-std.HashMap uses SIMD too (Swiss table design), plus:
-- Flat memory layout (better cache locality)
-- No tier jumping during probes
-- Optimized for typical 80% load factor
+**Mixed read/write workloads at 500K.** Up to 50% faster when the access pattern includes inserts and deletes alongside lookups.
 
-The elastic hash pays a cache penalty for the φ-ordering that provides worst-case guarantees.
+**Delete at all sizes.** 2-3x faster consistently.
 
-### vs Google's Original SwissTable (abseil)
+### Where abseil wins
 
-Benchmarking against Google's `absl::flat_hash_map` (the original SwissTable) reveals both Zig implementations are significantly slower:
+**Miss lookups: 2-3x faster.** Abseil's early termination on empty control byte groups stops miss probing after 1-2 groups. Our tiered structure scans 7 probes in tier 0 + 7 in tier 1 before concluding a miss.
 
-| Operation | Google SwissTable | Zig std.HashMap | Elastic Hash |
-|-----------|-------------------|-----------------|--------------|
-| Insert 1M @ 99% | 57ms | 779ms | 217ms |
-| Lookup 1M @ 99% | 43ms | 533ms | 1008ms |
+**Small tables (<100K).** Everything fits in L1, our tier overhead costs more than it saves.
 
-Google's implementation is **10-20x faster** than both Zig hashmaps. This is due to:
-- Years of optimization by Google engineers
-- Hand-tuned SIMD intrinsics for each platform
-- Cache prefetching and memory layout optimizations
-- 8-byte groups on ARM (vs 16-byte here)
+**Large tables (>4M).** Neither side's metadata fits in L2; abseil's flat layout has slightly less overhead.
 
-**The takeaway**: Within Zig, elastic hash wins on insert/delete. But abseil is in a different performance league entirely.
+**High load (99%).** Tier 0 is nearly full, probe depths increase, and the metadata density advantage disappears.
 
-### Why We Win on Insert
+### Caveats
 
-The batch insertion algorithm from the paper distributes elements efficiently:
-- Fills tier 0 to 75%, then starts using tier 1
-- Uses probe limits based on empty fraction (ε)
-- Avoids long probe chains that hurt std.HashMap at high load
+- Tested with u64 keys only. Abseil's hash is designed for strings and composite keys; our multiply hash is integer-specialized.
+- Single machine (x86_64, ~512KB L2). CPUs with different L2 sizes would shift the sweet spot.
+- Compiled with g++ (abseil) vs Zig/LLVM (elastic hash). Different compiler backends may generate different code quality.
 
 ## Architecture
 
-### Real Elastic Hashing
+### Relationship to the paper
 
-The implementation uses `tier0 = capacity/2` so elements actually spread across tiers:
-- Tier 0: ~50% of elements
-- Tier 1: ~25% of elements
-- Tier 2: ~12.5% of elements
-- etc.
+**Insertion** follows the paper: tiered arrays with geometrically decreasing sizes, batch insertion with three cases based on tier fullness, and probe limits from the f(epsilon) function.
 
-This is "real" elastic hashing as described in the paper, not just a single-tier SIMD hash table.
+**Lookup** searches tier 0 (fast inline path), then calls through an opaque function pointer to check tier 1 (cold overflow path). The function pointer boundary prevents LLVM from cascading optimizations that bloat the hot loop. At 99% load, `get()` finds 100% of elements (97.3% in tier 0, 2.7% in tier 1 via overflow). Early termination on empty slots in the overflow function reduces miss cost in tier 1.
 
-### SIMD Fingerprint Scanning
+### SIMD bucketed probing
 
-- 16-byte buckets scanned with SIMD vector comparison
-- 8-bit fingerprints (top byte of hash, 0=empty, 0xFF=tombstone)
+- 16-element buckets scanned with SSE2 vector comparison
+- 8-bit fingerprints (bits 32-39 of hash), 0=empty, 0xFF=tombstone
 - `@ctz` on bitmask for fast slot finding
-- Tombstone-based deletion (like std.HashMap)
+- Linear probing across buckets with upper-bit hash indexing
 
-### Separated Memory Layout
+### Memory layout
 
-Fingerprints, keys, and values stored in separate arrays:
-- Fingerprint scanning doesn't pollute cache with keys/values
-- 4 buckets' fingerprints fit in one 64-byte cache line
+- Fingerprints: separate dense array (1MB at 1M elements, fits in L2)
+- Entries: interleaved key-value pairs (value load is free after key check -- same cache line)
+- Software prefetch for entries at probe 0 (hides L3/DRAM latency for random access)
 
-## Files
+### Key parameters
 
-- `src/simple.zig` - Minimal implementation (~100 lines). Start here if you're learning.
-- `src/main.zig` - Optimized version with fingerprinting, batch insertion, and the φ priority function from the paper.
-- `src/hybrid.zig` - SIMD-accelerated version with:
-  - `HybridElasticHash` - Runtime version
-  - `ComptimeHybridElasticHash` - Compile-time version (faster when capacity is known)
-- `src/bench.zig` - Benchmarks
+| Parameter | Value | Why |
+|-----------|-------|-----|
+| BUCKET_SIZE | 16 | One SSE2 comparison per bucket |
+| MAX_PROBES | 7 | Minimum for 99% load correctness |
+| Batch threshold | 0.12 | 88% fill in tier 0 before tier 1 |
+| Hash | `key * c ^ (key * c >> 32)` | Single multiply, upper bits for bucket index |
 
-## Usage
+## Files
 
-### Test
+- `src/simple.zig` - Minimal implementation (~100 lines). Start here.
+- `src/main.zig` - Base implementation with fingerprinting and batch insertion.
+- `src/hybrid.zig` - SIMD-accelerated version:
+  - `HybridElasticHash` - Runtime version (primary optimization target)
+  - `ComptimeHybridElasticHash` - Compile-time version
+- `src/bench.zig` - Full benchmark suite
+- `src/autobench.zig` - Focused benchmark for abseil comparison
+- `bench-abseil.cpp` - Abseil benchmark (identical keys/capacity)
+- `bench-realistic.cpp` - Realistic workload benchmarks
+- `bench-v2.sh` - Runner that builds and compares both
+- `verify-results.md` - Verification methodology and findings
 
-```
-zig build test
-```
-
-### Benchmark
+## Usage
 
 ```
-zig build bench
+zig build test       # run tests
+zig build bench      # full benchmark
+bash bench-v2.sh     # comparison vs abseil (requires abseil-cpp)
 ```
 
-## Conclusion
-
-**Is this useful?** Yes, for specific use cases:
-
-| Use Case | Recommendation |
-|----------|----------------|
-| Write-heavy, high load (>95%) | **Use elastic hash** (4-8x insert win) |
-| Delete-heavy, high load | **Use elastic hash** (1.1-2.6x delete win) |
-| Known capacity at compile time | **Use ComptimeHybridElasticHash** (2-7x faster) |
-| Small datasets (<65k) | **Use elastic hash** (wins both insert and lookup) |
-| General purpose | Use std.HashMap |
-| Read-heavy, large datasets | Use std.HashMap |
+## Optimization log
 
-The elastic hash is not a drop-in replacement for std.HashMap, but it's a genuine win for write-heavy workloads at high load factors - which is exactly what the paper claimed.
+40+ experiments across three rounds. See `results-v2.tsv`, `results-v3.tsv` for logs and `insights-v2.md`, `insights-v3.md`, `verify-results.md` for analysis.
diff --git a/bench-abseil.cpp b/bench-abseil.cpp
@@ -0,0 +1,127 @@
+#include <cstdint>
+#include <cstdio>
+#include <cstdlib>
+#include <algorithm>
+#include <chrono>
+#include <vector>
+#include "absl/container/flat_hash_map.h"
+
+template <typename T>
+inline void do_not_optimize(T const& val) {
+    asm volatile("" : : "r,m"(val) : "memory");
+}
+
+static uint64_t splitmix64(uint64_t& state) {
+    state += 0x9e3779b97f4a7c15;
+    uint64_t z = state;
+    z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;
+    z = (z ^ (z >> 27)) * 0x94d049bb133111eb;
+    return z ^ (z >> 31);
+}
+
+constexpr uint64_t KEY_SEED = 0xDEADBEEF12345678;
+constexpr uint64_t MISS_SEED = 0xCAFEBABE87654321;
+constexpr int TOTAL_RUNS = 12;
+constexpr int WARMUP = 2;
+constexpr int MEASURED = TOTAL_RUNS - WARMUP;
+
+struct BenchResult {
+    uint64_t insert_us;
+    uint64_t lookup_us;
+    uint64_t delete_us;
+    uint64_t miss_us;
+};
+
+static uint64_t median(uint64_t* arr, int n) {
+    std::sort(arr, arr + n);
+    return arr[n / 2];
+}
+
+static BenchResult bench(size_t n, size_t fill) {
+    // Pre-generate keys (same for every run)
+    std::vector<uint64_t> keys(fill), miss_keys(fill);
+    uint64_t ks = KEY_SEED, ms = MISS_SEED;
+    for (size_t i = 0; i < fill; i++) {
+        keys[i] = splitmix64(ks);
+        miss_keys[i] = splitmix64(ms);
+    }
+
+    uint64_t ins[MEASURED], lkp[MEASURED], del[MEASURED], mis[MEASURED];
+
+    for (int r = 0; r < TOTAL_RUNS; r++) {
+        absl::flat_hash_map<uint64_t, uint64_t> map;
+        map.reserve(n);
+
+        auto start = std::chrono::steady_clock::now();
+        for (size_t i = 0; i < fill; i++) {
+            map.emplace(keys[i], i);
+        }
+        auto end = std::chrono::steady_clock::now();
+        uint64_t insert_us = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
+
+        start = std::chrono::steady_clock::now();
+        for (size_t i = 0; i < fill; i++) {
+            auto it = map.find(keys[i]);
+            do_not_optimize(it->second);
+        }
+        end = std::chrono::steady_clock::now();
+        uint64_t lookup_us = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
+
+        start = std::chrono::steady_clock::now();
+        for (size_t i = 0; i < fill; i++) {
+            auto it = map.find(miss_keys[i]);
+            do_not_optimize(it == map.end());
+        }
+        end = std::chrono::steady_clock::now();
+        uint64_t miss_us = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
+
+        start = std::chrono::steady_clock::now();
+        for (size_t i = 0; i < fill / 2; i++) {
+            auto it = map.find(keys[i]);
+            if (it != map.end()) map.erase(it);
+        }
+        end = std::chrono::steady_clock::now();
+        uint64_t delete_us = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
+
+        if (r >= WARMUP) {
+            int idx = r - WARMUP;
+            ins[idx] = insert_us;
+            lkp[idx] = lookup_us;
+            del[idx] = delete_us;
+            mis[idx] = miss_us;
+        }
+    }
+
+    return {median(ins, MEASURED), median(lkp, MEASURED),
+            median(del, MEASURED), median(mis, MEASURED)};
+}
+
+int main() {
+    constexpr size_t sizes[] = {16384, 65536, 262144, 1048576, 2097152};
+    constexpr int load_pcts[] = {10, 25, 50, 75, 90};
+
+    fprintf(stderr, "\n=== Abseil flat_hash_map Benchmark (median of %d, %d warmup) ===\n\n",
+            MEASURED, WARMUP);
+
+    // All sizes at 99% load
+    for (size_t n : sizes) {
+        size_t fill = n * 99 / 100;
+        auto r = bench(n, fill);
+        printf("RESULT\tn=%zu\tload=99\tinsert_us=%lu\tlookup_us=%lu\tdelete_us=%lu\tmiss_us=%lu\n",
+               n, r.insert_us, r.lookup_us, r.delete_us, r.miss_us);
+        fflush(stdout);
+    }
+
+    // Multiple load factors at 1M
+    constexpr size_t n = 1048576;
+    for (int pct : load_pcts) {
+        size_t fill = n * pct / 100;
+        auto r = bench(n, fill);
+        printf("RESULT\tn=%zu\tload=%d\tinsert_us=%lu\tlookup_us=%lu\tdelete_us=%lu\tmiss_us=%lu\n",
+               n, pct, r.insert_us, r.lookup_us, r.delete_us, r.miss_us);
+        fflush(stdout);
+    }
+
+    fprintf(stderr, "\nDONE\n");
+    return 0;
+}