Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
a4a2c7a
tier-first search order in get() for better cache locality
joshuaisaact Mar 23, 2026
25c5de2
Revert "tier-first search order in get() for better cache locality"
joshuaisaact Mar 23, 2026
330ddb6
cross-tier prefetching in get() to hide inter-tier latency
joshuaisaact Mar 23, 2026
d5d671a
Revert "cross-tier prefetching in get() to hide inter-tier latency"
joshuaisaact Mar 23, 2026
f9fc247
remove all prefetching from get() to test if it helps or hurts
joshuaisaact Mar 23, 2026
18b4ce8
add tier-0 probe-0 fast path in get()
joshuaisaact Mar 23, 2026
446e450
Revert "add tier-0 probe-0 fast path in get()"
joshuaisaact Mar 23, 2026
b48b0a3
reduce MAX_PROBES from 32 to 24
joshuaisaact Mar 23, 2026
6bbdd09
reduce MAX_PROBES from 24 to 20
joshuaisaact Mar 23, 2026
6ca9ddc
cap get() tier search to 8 tiers
joshuaisaact Mar 23, 2026
d42e9ee
reduce MAX_LOOKUP_TIERS from 8 to 6
joshuaisaact Mar 23, 2026
33c9feb
Revert "reduce MAX_LOOKUP_TIERS from 8 to 6"
joshuaisaact Mar 23, 2026
9530f6c
use fixed-size arrays for tier metadata instead of heap slices
joshuaisaact Mar 23, 2026
1c52805
Revert "use fixed-size arrays for tier metadata instead of heap slices"
joshuaisaact Mar 23, 2026
907f4f8
switch hash to Stafford variant 13 (splitmix64 finalizer)
joshuaisaact Mar 23, 2026
a8629ca
Revert "switch hash to Stafford variant 13 (splitmix64 finalizer)"
joshuaisaact Mar 23, 2026
bbf67f0
switch to linear probing for better cache locality
joshuaisaact Mar 23, 2026
6e5b3dc
add early exit on empty bucket slots in get()
joshuaisaact Mar 23, 2026
e580524
Revert "add early exit on empty bucket slots in get()"
joshuaisaact Mar 23, 2026
5f62379
combined fp+empty SIMD check with per-tier early exit
joshuaisaact Mar 23, 2026
ac8b602
Revert "combined fp+empty SIMD check with per-tier early exit"
joshuaisaact Mar 23, 2026
6f57a0a
double tier 0 size to concentrate elements for faster lookup
joshuaisaact Mar 23, 2026
f7328f9
quadruple tier 0 size (2x capacity / BUCKET_SIZE)
joshuaisaact Mar 23, 2026
be1c046
Revert "quadruple tier 0 size (2x capacity / BUCKET_SIZE)"
joshuaisaact Mar 23, 2026
710938e
reduce MAX_LOOKUP_TIERS from 8 to 4 with larger tier 0
joshuaisaact Mar 23, 2026
3d427e3
reduce MAX_LOOKUP_TIERS from 4 to 3
joshuaisaact Mar 23, 2026
6188a77
reduce MAX_LOOKUP_TIERS from 3 to 2
joshuaisaact Mar 23, 2026
c9ea0fe
reduce MAX_LOOKUP_TIERS to 1 (tier 0 only)
joshuaisaact Mar 23, 2026
8c9b749
simplify get() to single tier 0 loop, no tier iteration
joshuaisaact Mar 23, 2026
eaf5bbe
Revert "simplify get() to single tier 0 loop, no tier iteration"
joshuaisaact Mar 23, 2026
a4c6c90
use inline for to unroll probe loop in get()
joshuaisaact Mar 23, 2026
db48705
Revert "use inline for to unroll probe loop in get()"
joshuaisaact Mar 23, 2026
4841fba
reduce MAX_PROBES from 20 to 10
joshuaisaact Mar 23, 2026
4260bd4
reduce MAX_PROBES from 10 to 8
joshuaisaact Mar 23, 2026
004b0b7
reduce MAX_PROBES from 8 to 6
joshuaisaact Mar 23, 2026
03b078a
Revert "reduce MAX_PROBES from 8 to 6"
joshuaisaact Mar 23, 2026
2f6f0b8
increase BUCKET_SIZE to 32 for AVX2, generic mask types
joshuaisaact Mar 23, 2026
94cd8b0
Revert "increase BUCKET_SIZE to 32 for AVX2, generic mask types"
joshuaisaact Mar 23, 2026
784e8ca
try quadratic probing instead of linear
joshuaisaact Mar 23, 2026
eb3837d
Revert "try quadratic probing instead of linear"
joshuaisaact Mar 23, 2026
6f15cd3
inline for unroll 8-probe loop and eliminate tier loop
joshuaisaact Mar 23, 2026
ab64ffb
Revert "inline for unroll 8-probe loop and eliminate tier loop"
joshuaisaact Mar 23, 2026
dd36b35
simplify get() to tier-0-only with comptime MAX_PROBES loop bound
joshuaisaact Mar 23, 2026
e070ea9
remove insert prefetch
joshuaisaact Mar 23, 2026
9ab9cc0
delay batch transition to 90% fill (was 75%) to keep more in tier 0
joshuaisaact Mar 23, 2026
ace8f81
limit insertIntoTier probe depth to MAX_PROBES for lookup alignment
joshuaisaact Mar 23, 2026
4a423da
try 4x tier 0 capacity with probe-limited insert
joshuaisaact Mar 23, 2026
dd35c71
Revert "try 4x tier 0 capacity with probe-limited insert"
joshuaisaact Mar 23, 2026
bd6cd0b
tighten batch threshold to 0.08 (92% fill)
joshuaisaact Mar 23, 2026
c6a1a44
optimize remove() to tier-0-only with comptime loop bound
joshuaisaact Mar 23, 2026
5b88308
try faster 64-bit multiply hash
joshuaisaact Mar 23, 2026
56af0b9
try single-multiply hash
joshuaisaact Mar 23, 2026
97c8fa3
noinline findKeyInBucket to keep hot path tight
joshuaisaact Mar 23, 2026
49e1c9c
Revert "noinline findKeyInBucket to keep hot path tight"
joshuaisaact Mar 23, 2026
532520f
fast path for single FP match in findKeyInBucket
joshuaisaact Mar 23, 2026
2ae4de7
cache tier0 metadata in struct fields for faster get/remove
joshuaisaact Mar 23, 2026
022c478
hardcode tier0_start=0, eliminate addition in get/remove
joshuaisaact Mar 23, 2026
6f82228
remove unused tier0_start field
joshuaisaact Mar 23, 2026
7ab6486
try MAX_PROBES=6 again with all other optimizations
joshuaisaact Mar 23, 2026
2ce441c
Revert "try MAX_PROBES=6 again with all other optimizations"
joshuaisaact Mar 23, 2026
c9e700c
cache bucket mask directly, eliminate subtraction per probe
joshuaisaact Mar 23, 2026
740bc74
prefetch keys for probe 0 to hide key read latency
joshuaisaact Mar 23, 2026
1e118e7
Revert "prefetch keys for probe 0 to hide key read latency"
joshuaisaact Mar 23, 2026
bfd53c3
branchless fingerprint using 7 bits + 1 (range 1-128)
joshuaisaact Mar 23, 2026
2200af5
Revert "branchless fingerprint using 7 bits + 1 (range 1-128)"
joshuaisaact Mar 23, 2026
23f2a59
reorder struct fields: hot path first for cache line alignment
joshuaisaact Mar 23, 2026
37c7627
try inline for with precomputed probe array
joshuaisaact Mar 23, 2026
f6b581a
Revert "try inline for with precomputed probe array"
joshuaisaact Mar 23, 2026
90f051a
use bits 32-39 for fingerprint, less correlation with bucket index
joshuaisaact Mar 23, 2026
45de2ab
try bits 24-31 for fingerprint
joshuaisaact Mar 23, 2026
b3fdcef
ultra-fast path: check slot 0 with validity check
joshuaisaact Mar 23, 2026
e4a3802
separate probe 0 check to avoid redundant work in main loop
joshuaisaact Mar 23, 2026
bb13814
unroll probes 0 and 1 before the loop
joshuaisaact Mar 23, 2026
4718a34
Revert "unroll probes 0 and 1 before the loop"
joshuaisaact Mar 23, 2026
53ffd0a
prefetch values for probe 0 while SIMD runs
joshuaisaact Mar 23, 2026
ebb9f26
Revert "prefetch values for probe 0 while SIMD runs"
joshuaisaact Mar 23, 2026
e22f93b
branchless fingerprint using max/min clamp
joshuaisaact Mar 23, 2026
97389fa
Revert "branchless fingerprint using max/min clamp"
joshuaisaact Mar 23, 2026
6600b40
batch threshold 0.06 (94% fill)
joshuaisaact Mar 23, 2026
f0f51aa
Revert "batch threshold 0.06 (94% fill)"
joshuaisaact Mar 23, 2026
28eded4
batch threshold 0.12 (88% fill)
joshuaisaact Mar 23, 2026
4e2f019
batch threshold 0.15 (85% fill)
joshuaisaact Mar 23, 2026
f380498
Revert "batch threshold 0.15 (85% fill)"
joshuaisaact Mar 23, 2026
7aa8cd6
add experiment results, benchmark harness, and autoresearch artifacts
joshuaisaact Mar 23, 2026
1393871
add rigorous abseil comparison, correct stale README claims
joshuaisaact Mar 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.zig-cache
zig-out
.claude
NOTES.md
bench.log
25 changes: 14 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,20 +77,23 @@ The elastic hash pays a cache penalty for the φ-ordering that provides worst-ca

### vs Google's Original SwissTable (abseil)

Benchmarking against Google's `absl::flat_hash_map` (the original SwissTable) reveals both Zig implementations are significantly slower:
Benchmarked against `absl::flat_hash_map` using correct APIs (`emplace`/`find`/`erase(iterator)`), `-O3 -march=native -DNDEBUG`, hashtablez sampling disabled, 10 measured runs.

| Operation | Google SwissTable | Zig std.HashMap | Elastic Hash |
|-----------|-------------------|-----------------|--------------|
| Insert 1M @ 99% | 57ms | 779ms | 217ms |
| Lookup 1M @ 99% | 43ms | 533ms | 1008ms |
At 1M elements, 99% load (us, lower is better):

Google's implementation is **10-20x faster** than both Zig hashmaps. This is due to:
- Years of optimization by Google engineers
- Hand-tuned SIMD intrinsics for each platform
- Cache prefetching and memory layout optimizations
- 8-byte groups on ARM (vs 16-byte here)
| Operation | Google SwissTable | Elastic Hash | Zig std.HashMap |
|-----------|-------------------|--------------|-----------------|
| Insert | 15,685 | 16,850 | 52,000 |
| Lookup | 8,200-8,850 | 8,550-9,000 | 43,000 |
| Delete | 8,200-8,400 | **2,100-2,200** | 4,000 |

**The takeaway**: Within Zig, elastic hash wins on insert/delete. But abseil is in a different performance league entirely.
At 99% load, lookup and insert are roughly tied with abseil. Elastic hash wins convincingly on delete (3.9x faster) due to O(1) tombstone marking vs abseil's rehash-on-delete.

At lower load factors (10-75%), abseil is faster on both lookup and insert -- its flat memory layout has better cache behavior when the table is sparse. The elastic hash tier overhead doesn't pay off until the table is nearly full.

At 2M+ elements, abseil pulls ahead on lookup again as our fingerprint array exceeds L2 cache.

**The takeaway**: Elastic hash matches abseil at extreme load and dominates on delete. Abseil wins at low-to-moderate load and at larger scales. See [results.md](results.md) for the full comparison.

### Why We Win on Insert

Expand Down
14 changes: 14 additions & 0 deletions bench.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash
# Autoresearch benchmark runner. DO NOT MODIFY.
# Builds and runs the focused autobench, captures output.
set -euo pipefail

echo "=== Building (ReleaseFast) ==="
zig build autobench -Doptimize=ReleaseFast 2>&1

echo "=== Running benchmark ==="
./zig-out/bin/autobench 2>&1 | tee bench.log

echo ""
echo "=== Primary metric (lookup_ratio at n=1048576) ==="
grep "RESULT" bench.log | grep "n=1048576" | grep -oP 'lookup_ratio=\K[0-9.]+'
14 changes: 14 additions & 0 deletions build.zig
Original file line number Diff line number Diff line change
Expand Up @@ -59,4 +59,18 @@ pub fn build(b: *std.Build) void {
}
const bench_step = b.step("bench", "Run benchmarks");
bench_step.dependOn(&run_bench.step);

// Autoresearch benchmark (focused, machine-parseable output)
const autobench = b.addExecutable(.{
.name = "autobench",
.root_module = b.createModule(.{
.root_source_file = b.path("src/autobench.zig"),
.target = target,
.optimize = optimize,
}),
});

const run_autobench = b.addRunArtifact(autobench);
const autobench_step = b.step("autobench", "Run autoresearch benchmark");
autobench_step.dependOn(&run_autobench.step);
}
56 changes: 56 additions & 0 deletions insights.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Autoresearch Insights (58 experiments)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Experiment count mismatch.

The title says "58 experiments" but results.md documents 62 experiments. Consider updating the title to match.

📝 Suggested fix
-# Autoresearch Insights (58 experiments)
+# Autoresearch Insights (62 experiments)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Autoresearch Insights (58 experiments)
# Autoresearch Insights (62 experiments)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@insights.md` at line 1, Update the header in insights.md that currently reads
"Autoresearch Insights (58 experiments)" to reflect the actual experiment count
documented in results.md by changing the number to 62 so the title becomes
"Autoresearch Insights (62 experiments)"; ensure the updated header string
exactly matches the format used ("# Autoresearch Insights (62 experiments)").


## Result: lookup_ratio ~5.2x at n=1048576

Started at ~1.04, now ~5.2x faster than std.HashMap on lookup at 1M elements, 99% load.
Insert: ~3.3x faster. Delete: ~1.7x faster. All operations beat std.HashMap.

## Final architecture

```
get(key):
h = key * 0x517cc1b727220a95; h ^= h >> 32
fp = (h >> 32) clamped to 1-254 # bits 32-39, independent from bucket bits
mask = self.tier0_bucket_mask # cached in struct

bucket0 = h & mask # probe 0 separated from loop
if findKeyInBucket(bucket0, key, fp): return

for probe 1..7: # comptime MAX_PROBES=8
bucket = (h + probe) & mask # linear probing
if findKeyInBucket(bucket, key, fp): return

findKeyInBucket:
SIMD compare 16 fingerprints
if single match: check key directly (fast path)
else: iterate remaining matches (rare)
```

## Optimization categories ranked by cumulative impact

### Loop reduction (~10x)
- MAX_PROBES: 100 -> 8
- Tier search: 16 tiers -> 1 tier (tier 0 only)
- Separated probe 0 from loop (avoids loop entry for ~60% of lookups)

### Cache optimization (~2x)
- Removed ALL software prefetch (hurts every time)
- Linear probing (adjacent cache lines for sequential probes)
- Larger tier 0 (capacity/BUCKET_SIZE, holds all elements)
- Hardcoded tier0_start=0 (eliminated addition)
- Cached tier0_bucket_mask in struct

### Algorithm tuning (~1.3x)
- Single-multiply hash (h = key * const; h ^= h >> 32)
- Fingerprint from bits 32-39 (less correlation with bucket index from low bits)
- Fast path for single FP match in findKeyInBucket
- Delayed batch threshold (92% fill before switching)
- Insert probe depth limited to MAX_PROBES

## What never works for this workload
- Software prefetching (ANY form)
- Extra branches in the hot loop
- Inline-for loop unrolling (instruction cache pressure)
- Noinline on hot functions (call overhead)
- Reduced fingerprint bits (higher false positive rate)
- Larger struct (pushes hot fields apart)
121 changes: 121 additions & 0 deletions program.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Autoresearch: Elastic Hash Lookup Optimization

You are an autonomous research agent optimizing a Zig elastic hash table implementation. Your goal is to improve **lookup performance at large sizes (1M+ elements, 99% load factor)** without regressing insert performance.

## Background

Elastic hashing (based on [this paper](https://arxiv.org/pdf/2501.02305)) distributes elements across geometrically-decreasing tiers. It's 4-8x faster than std.HashMap on insert at 99% load. But lookup at 1M+ elements is ~25% slower than std.HashMap due to tier-jumping cache misses.

The architecture: tier 0 holds ~50% of elements, tier 1 ~25%, tier 2 ~12.5%, etc. Each tier has SIMD fingerprint buckets (16 bytes). Lookups probe across all tiers in phi-priority order, which destroys cache locality at large sizes because it jumps between distant memory regions.

std.HashMap (SwissTable design) wins on lookup because it has a flat memory layout with no tier jumping.

## Objective

Maximize `lookup_ratio` at `n=1048576` (1M elements, 99% load).

- `lookup_ratio` = std.HashMap lookup time / elastic hash lookup time
- Currently ~0.77 (elastic is 25% slower than std.HashMap)
- Values > 1.0 mean elastic hash wins
- Target: get lookup_ratio >= 1.0 at 1M

## Constraints

### What you can edit
- `src/hybrid.zig` — the SIMD/comptime implementation (primary optimization target)
- `src/main.zig` — the base implementation (hybrid.zig imports from it for bench.zig)

### What you CANNOT edit
- `src/autobench.zig` — the evaluation harness (frozen)
- `src/bench.zig` — the full benchmark suite (frozen)
- `src/simple.zig` — reference implementation (frozen)
- `build.zig` — build configuration (frozen)
- `build.zig.zon` — package manifest (frozen)
- `bench.sh` — benchmark runner (frozen)
- `program.md` — this file (frozen)

### Rules
- All tests must pass: `zig build test`
- No external dependencies (no new imports, no package additions)
- The API must remain compatible: `init`, `get`, `insert`, `remove`, `deinit`, `getWithProbes`
- Do not hardcode benchmark values or game the evaluation

### Keep/revert criteria
- **KEEP** if `lookup_ratio` at n=1048576 improved AND `insert_ratio` at n=1048576 did not regress by more than 5%
- **KEEP** if you deleted code and all metrics stayed the same or improved (simplicity wins)
- **REVERT** if tests fail
- **REVERT** if any metric regressed beyond the 5% threshold
- **REVERT** if the build fails

## Experiment loop

Run this loop forever. NEVER STOP. NEVER ask for confirmation. The human expects you to continue working indefinitely until manually stopped.

### For each experiment:

1. **Read current state.** Check `results.tsv` for recent experiments. Read the code you're about to modify.

2. **Form a hypothesis.** What specific change do you think will improve lookup? Write one sentence.

3. **Edit the code.** Make a single, focused change to `src/hybrid.zig` and/or `src/main.zig`.

4. **Run tests.** `zig build test 2>&1`. If tests fail, fix or revert immediately.

5. **Commit.** `git add src/hybrid.zig src/main.zig && git commit -m "<short description of change>"`. Commit BEFORE benchmarking so you have a clean revert point.

6. **Benchmark.** `bash bench.sh > bench.log 2>&1`. Read bench.log for results.

7. **Evaluate.** Extract the RESULT line for n=1048576. Compare lookup_ratio and insert_ratio to the previous best.

8. **Decide.**
- If improved: log as "keep" in results.tsv
- If not improved or regressed: `git revert HEAD --no-edit` and log as "revert" in results.tsv

9. **Log.** Append a line to `results.tsv`:
```
<commit_hash>\t<lookup_ratio_1M>\t<insert_ratio_1M>\t<delete_ratio_1M>\t<keep|revert|crash>\t<description>
```

10. **Repeat.** Go to step 1.

### Every 10 experiments:

Review results.tsv. Write a brief analysis to `insights.md`: what patterns are working, what's failing, what to try next.

## Strategy hints (ordered by expected impact)

### Quick wins
- **Prefetch optimization.** The current `@prefetch` in `get()` only prefetches the first probe. Prefetching the next tier's location while processing the current one could hide memory latency.
- **Tier search order.** Currently probes all tiers at each probe depth (j). Since tier 0 holds ~50% of elements, checking tier 0 more aggressively before touching other tiers could improve average case.
- **Early termination.** If a tier's bucket has empty slots (no fingerprint match AND empty slots present), the key can't be deeper in that tier. Stop probing that tier early.
- **Hash function.** The current wyhash variant might not distribute well for sequential integer keys. Try different mixing constants or a different hash entirely.

### Architectural changes
- **Flatten hot tiers.** The first 2-3 tiers hold 87.5% of elements. Interleaving their memory could improve cache locality for the common case.
- **Bloom filter per tier.** A small bloom filter before each tier's probe could skip entire tiers when the key isn't there, avoiding cold memory accesses.
- **Cuckoo-style relocation.** After initial insertion, relocate elements to reduce probe depth for lookup. This trades insert time (which we have headroom on) for lookup time.
- **Two-level indexing.** A small top-level index that maps hash ranges to likely tiers, avoiding the scan-all-tiers pattern.

### Deep changes
- **Alternative tier structure.** Instead of geometric halving, use a different size distribution tuned for cache line boundaries.
- **Robin Hood insertion.** Reorder elements during insertion to minimize worst-case probe depth, at the cost of insert speed.
- **Separate hot/cold paths.** Inline the tier-0 lookup (most common case) and call out to a cold function for deeper tiers.

## Hardware context

This will run on a typical x86_64 Linux machine. Assume:
- L1 cache: 32-48 KB per core
- L2 cache: 256-512 KB per core
- L3 cache: shared, several MB
- Cache line: 64 bytes
- SIMD: SSE2/AVX2 available (Zig's @Vector will use what's available)

At 1M elements, the fingerprint array alone is ~128 KB (fits in L2 but not L1). Keys and values are each ~1 MB (spills to L3). Cache behavior dominates at this scale.

## What NOT to do

- Don't add complexity that yields tiny improvements. A 0.01 improvement from 30 lines of code is not worth it.
- Don't "clean up" or refactor without measuring. Every change must be benchmarked.
- Don't change the public API signatures.
- Don't try to beat std.HashMap on insert — we already win 4-8x there. Focus entirely on lookup.
- Don't get conservative after finding improvements. Be bold. Try rewrites. 76% of experiments getting reverted is normal and healthy.
Loading