-
Notifications
You must be signed in to change notification settings - Fork 0
Optimize lookup: 5x faster than std.HashMap at 1M elements #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
joshuaisaact
wants to merge
85
commits into
main
Choose a base branch
from
autoresearch/lookup-optimization
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
85 commits
Select commit
Hold shift + click to select a range
a4a2c7a
tier-first search order in get() for better cache locality
joshuaisaact 25c5de2
Revert "tier-first search order in get() for better cache locality"
joshuaisaact 330ddb6
cross-tier prefetching in get() to hide inter-tier latency
joshuaisaact d5d671a
Revert "cross-tier prefetching in get() to hide inter-tier latency"
joshuaisaact f9fc247
remove all prefetching from get() to test if it helps or hurts
joshuaisaact 18b4ce8
add tier-0 probe-0 fast path in get()
joshuaisaact 446e450
Revert "add tier-0 probe-0 fast path in get()"
joshuaisaact b48b0a3
reduce MAX_PROBES from 32 to 24
joshuaisaact 6bbdd09
reduce MAX_PROBES from 24 to 20
joshuaisaact 6ca9ddc
cap get() tier search to 8 tiers
joshuaisaact d42e9ee
reduce MAX_LOOKUP_TIERS from 8 to 6
joshuaisaact 33c9feb
Revert "reduce MAX_LOOKUP_TIERS from 8 to 6"
joshuaisaact 9530f6c
use fixed-size arrays for tier metadata instead of heap slices
joshuaisaact 1c52805
Revert "use fixed-size arrays for tier metadata instead of heap slices"
joshuaisaact 907f4f8
switch hash to Stafford variant 13 (splitmix64 finalizer)
joshuaisaact a8629ca
Revert "switch hash to Stafford variant 13 (splitmix64 finalizer)"
joshuaisaact bbf67f0
switch to linear probing for better cache locality
joshuaisaact 6e5b3dc
add early exit on empty bucket slots in get()
joshuaisaact e580524
Revert "add early exit on empty bucket slots in get()"
joshuaisaact 5f62379
combined fp+empty SIMD check with per-tier early exit
joshuaisaact ac8b602
Revert "combined fp+empty SIMD check with per-tier early exit"
joshuaisaact 6f57a0a
double tier 0 size to concentrate elements for faster lookup
joshuaisaact f7328f9
quadruple tier 0 size (2x capacity / BUCKET_SIZE)
joshuaisaact be1c046
Revert "quadruple tier 0 size (2x capacity / BUCKET_SIZE)"
joshuaisaact 710938e
reduce MAX_LOOKUP_TIERS from 8 to 4 with larger tier 0
joshuaisaact 3d427e3
reduce MAX_LOOKUP_TIERS from 4 to 3
joshuaisaact 6188a77
reduce MAX_LOOKUP_TIERS from 3 to 2
joshuaisaact c9ea0fe
reduce MAX_LOOKUP_TIERS to 1 (tier 0 only)
joshuaisaact 8c9b749
simplify get() to single tier 0 loop, no tier iteration
joshuaisaact eaf5bbe
Revert "simplify get() to single tier 0 loop, no tier iteration"
joshuaisaact a4c6c90
use inline for to unroll probe loop in get()
joshuaisaact db48705
Revert "use inline for to unroll probe loop in get()"
joshuaisaact 4841fba
reduce MAX_PROBES from 20 to 10
joshuaisaact 4260bd4
reduce MAX_PROBES from 10 to 8
joshuaisaact 004b0b7
reduce MAX_PROBES from 8 to 6
joshuaisaact 03b078a
Revert "reduce MAX_PROBES from 8 to 6"
joshuaisaact 2f6f0b8
increase BUCKET_SIZE to 32 for AVX2, generic mask types
joshuaisaact 94cd8b0
Revert "increase BUCKET_SIZE to 32 for AVX2, generic mask types"
joshuaisaact 784e8ca
try quadratic probing instead of linear
joshuaisaact eb3837d
Revert "try quadratic probing instead of linear"
joshuaisaact 6f15cd3
inline for unroll 8-probe loop and eliminate tier loop
joshuaisaact ab64ffb
Revert "inline for unroll 8-probe loop and eliminate tier loop"
joshuaisaact dd36b35
simplify get() to tier-0-only with comptime MAX_PROBES loop bound
joshuaisaact e070ea9
remove insert prefetch
joshuaisaact 9ab9cc0
delay batch transition to 90% fill (was 75%) to keep more in tier 0
joshuaisaact ace8f81
limit insertIntoTier probe depth to MAX_PROBES for lookup alignment
joshuaisaact 4a423da
try 4x tier 0 capacity with probe-limited insert
joshuaisaact dd35c71
Revert "try 4x tier 0 capacity with probe-limited insert"
joshuaisaact bd6cd0b
tighten batch threshold to 0.08 (92% fill)
joshuaisaact c6a1a44
optimize remove() to tier-0-only with comptime loop bound
joshuaisaact 5b88308
try faster 64-bit multiply hash
joshuaisaact 56af0b9
try single-multiply hash
joshuaisaact 97c8fa3
noinline findKeyInBucket to keep hot path tight
joshuaisaact 49e1c9c
Revert "noinline findKeyInBucket to keep hot path tight"
joshuaisaact 532520f
fast path for single FP match in findKeyInBucket
joshuaisaact 2ae4de7
cache tier0 metadata in struct fields for faster get/remove
joshuaisaact 022c478
hardcode tier0_start=0, eliminate addition in get/remove
joshuaisaact 6f82228
remove unused tier0_start field
joshuaisaact 7ab6486
try MAX_PROBES=6 again with all other optimizations
joshuaisaact 2ce441c
Revert "try MAX_PROBES=6 again with all other optimizations"
joshuaisaact c9e700c
cache bucket mask directly, eliminate subtraction per probe
joshuaisaact 740bc74
prefetch keys for probe 0 to hide key read latency
joshuaisaact 1e118e7
Revert "prefetch keys for probe 0 to hide key read latency"
joshuaisaact bfd53c3
branchless fingerprint using 7 bits + 1 (range 1-128)
joshuaisaact 2200af5
Revert "branchless fingerprint using 7 bits + 1 (range 1-128)"
joshuaisaact 23f2a59
reorder struct fields: hot path first for cache line alignment
joshuaisaact 37c7627
try inline for with precomputed probe array
joshuaisaact f6b581a
Revert "try inline for with precomputed probe array"
joshuaisaact 90f051a
use bits 32-39 for fingerprint, less correlation with bucket index
joshuaisaact 45de2ab
try bits 24-31 for fingerprint
joshuaisaact b3fdcef
ultra-fast path: check slot 0 with validity check
joshuaisaact e4a3802
separate probe 0 check to avoid redundant work in main loop
joshuaisaact bb13814
unroll probes 0 and 1 before the loop
joshuaisaact 4718a34
Revert "unroll probes 0 and 1 before the loop"
joshuaisaact 53ffd0a
prefetch values for probe 0 while SIMD runs
joshuaisaact ebb9f26
Revert "prefetch values for probe 0 while SIMD runs"
joshuaisaact e22f93b
branchless fingerprint using max/min clamp
joshuaisaact 97389fa
Revert "branchless fingerprint using max/min clamp"
joshuaisaact 6600b40
batch threshold 0.06 (94% fill)
joshuaisaact f0f51aa
Revert "batch threshold 0.06 (94% fill)"
joshuaisaact 28eded4
batch threshold 0.12 (88% fill)
joshuaisaact 4e2f019
batch threshold 0.15 (85% fill)
joshuaisaact f380498
Revert "batch threshold 0.15 (85% fill)"
joshuaisaact 7aa8cd6
add experiment results, benchmark harness, and autoresearch artifacts
joshuaisaact 1393871
add rigorous abseil comparison, correct stale README claims
joshuaisaact File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,5 @@ | ||
| .zig-cache | ||
| zig-out | ||
| .claude | ||
| NOTES.md | ||
| bench.log |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,14 @@ | ||
| #!/bin/bash | ||
| # Autoresearch benchmark runner. DO NOT MODIFY. | ||
| # Builds and runs the focused autobench, captures output. | ||
| set -euo pipefail | ||
|
|
||
| echo "=== Building (ReleaseFast) ===" | ||
| zig build autobench -Doptimize=ReleaseFast 2>&1 | ||
|
|
||
| echo "=== Running benchmark ===" | ||
| ./zig-out/bin/autobench 2>&1 | tee bench.log | ||
|
|
||
| echo "" | ||
| echo "=== Primary metric (lookup_ratio at n=1048576) ===" | ||
| grep "RESULT" bench.log | grep "n=1048576" | grep -oP 'lookup_ratio=\K[0-9.]+' |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| # Autoresearch Insights (58 experiments) | ||
|
|
||
| ## Result: lookup_ratio ~5.2x at n=1048576 | ||
|
|
||
| Started at ~1.04, now ~5.2x faster than std.HashMap on lookup at 1M elements, 99% load. | ||
| Insert: ~3.3x faster. Delete: ~1.7x faster. All operations beat std.HashMap. | ||
|
|
||
| ## Final architecture | ||
|
|
||
| ``` | ||
| get(key): | ||
| h = key * 0x517cc1b727220a95; h ^= h >> 32 | ||
| fp = (h >> 32) clamped to 1-254 # bits 32-39, independent from bucket bits | ||
| mask = self.tier0_bucket_mask # cached in struct | ||
|
|
||
| bucket0 = h & mask # probe 0 separated from loop | ||
| if findKeyInBucket(bucket0, key, fp): return | ||
|
|
||
| for probe 1..7: # comptime MAX_PROBES=8 | ||
| bucket = (h + probe) & mask # linear probing | ||
| if findKeyInBucket(bucket, key, fp): return | ||
|
|
||
| findKeyInBucket: | ||
| SIMD compare 16 fingerprints | ||
| if single match: check key directly (fast path) | ||
| else: iterate remaining matches (rare) | ||
| ``` | ||
|
|
||
| ## Optimization categories ranked by cumulative impact | ||
|
|
||
| ### Loop reduction (~10x) | ||
| - MAX_PROBES: 100 -> 8 | ||
| - Tier search: 16 tiers -> 1 tier (tier 0 only) | ||
| - Separated probe 0 from loop (avoids loop entry for ~60% of lookups) | ||
|
|
||
| ### Cache optimization (~2x) | ||
| - Removed ALL software prefetch (hurts every time) | ||
| - Linear probing (adjacent cache lines for sequential probes) | ||
| - Larger tier 0 (capacity/BUCKET_SIZE, holds all elements) | ||
| - Hardcoded tier0_start=0 (eliminated addition) | ||
| - Cached tier0_bucket_mask in struct | ||
|
|
||
| ### Algorithm tuning (~1.3x) | ||
| - Single-multiply hash (h = key * const; h ^= h >> 32) | ||
| - Fingerprint from bits 32-39 (less correlation with bucket index from low bits) | ||
| - Fast path for single FP match in findKeyInBucket | ||
| - Delayed batch threshold (92% fill before switching) | ||
| - Insert probe depth limited to MAX_PROBES | ||
|
|
||
| ## What never works for this workload | ||
| - Software prefetching (ANY form) | ||
| - Extra branches in the hot loop | ||
| - Inline-for loop unrolling (instruction cache pressure) | ||
| - Noinline on hot functions (call overhead) | ||
| - Reduced fingerprint bits (higher false positive rate) | ||
| - Larger struct (pushes hot fields apart) | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,121 @@ | ||
| # Autoresearch: Elastic Hash Lookup Optimization | ||
|
|
||
| You are an autonomous research agent optimizing a Zig elastic hash table implementation. Your goal is to improve **lookup performance at large sizes (1M+ elements, 99% load factor)** without regressing insert performance. | ||
|
|
||
| ## Background | ||
|
|
||
| Elastic hashing (based on [this paper](https://arxiv.org/pdf/2501.02305)) distributes elements across geometrically-decreasing tiers. It's 4-8x faster than std.HashMap on insert at 99% load. But lookup at 1M+ elements is ~25% slower than std.HashMap due to tier-jumping cache misses. | ||
|
|
||
| The architecture: tier 0 holds ~50% of elements, tier 1 ~25%, tier 2 ~12.5%, etc. Each tier has SIMD fingerprint buckets (16 bytes). Lookups probe across all tiers in phi-priority order, which destroys cache locality at large sizes because it jumps between distant memory regions. | ||
|
|
||
| std.HashMap (SwissTable design) wins on lookup because it has a flat memory layout with no tier jumping. | ||
|
|
||
| ## Objective | ||
|
|
||
| Maximize `lookup_ratio` at `n=1048576` (1M elements, 99% load). | ||
|
|
||
| - `lookup_ratio` = std.HashMap lookup time / elastic hash lookup time | ||
| - Currently ~0.77 (elastic is 25% slower than std.HashMap) | ||
| - Values > 1.0 mean elastic hash wins | ||
| - Target: get lookup_ratio >= 1.0 at 1M | ||
|
|
||
| ## Constraints | ||
|
|
||
| ### What you can edit | ||
| - `src/hybrid.zig` — the SIMD/comptime implementation (primary optimization target) | ||
| - `src/main.zig` — the base implementation (hybrid.zig imports from it for bench.zig) | ||
|
|
||
| ### What you CANNOT edit | ||
| - `src/autobench.zig` — the evaluation harness (frozen) | ||
| - `src/bench.zig` — the full benchmark suite (frozen) | ||
| - `src/simple.zig` — reference implementation (frozen) | ||
| - `build.zig` — build configuration (frozen) | ||
| - `build.zig.zon` — package manifest (frozen) | ||
| - `bench.sh` — benchmark runner (frozen) | ||
| - `program.md` — this file (frozen) | ||
|
|
||
| ### Rules | ||
| - All tests must pass: `zig build test` | ||
| - No external dependencies (no new imports, no package additions) | ||
| - The API must remain compatible: `init`, `get`, `insert`, `remove`, `deinit`, `getWithProbes` | ||
| - Do not hardcode benchmark values or game the evaluation | ||
|
|
||
| ### Keep/revert criteria | ||
| - **KEEP** if `lookup_ratio` at n=1048576 improved AND `insert_ratio` at n=1048576 did not regress by more than 5% | ||
| - **KEEP** if you deleted code and all metrics stayed the same or improved (simplicity wins) | ||
| - **REVERT** if tests fail | ||
| - **REVERT** if any metric regressed beyond the 5% threshold | ||
| - **REVERT** if the build fails | ||
|
|
||
| ## Experiment loop | ||
|
|
||
| Run this loop forever. NEVER STOP. NEVER ask for confirmation. The human expects you to continue working indefinitely until manually stopped. | ||
|
|
||
| ### For each experiment: | ||
|
|
||
| 1. **Read current state.** Check `results.tsv` for recent experiments. Read the code you're about to modify. | ||
|
|
||
| 2. **Form a hypothesis.** What specific change do you think will improve lookup? Write one sentence. | ||
|
|
||
| 3. **Edit the code.** Make a single, focused change to `src/hybrid.zig` and/or `src/main.zig`. | ||
|
|
||
| 4. **Run tests.** `zig build test 2>&1`. If tests fail, fix or revert immediately. | ||
|
|
||
| 5. **Commit.** `git add src/hybrid.zig src/main.zig && git commit -m "<short description of change>"`. Commit BEFORE benchmarking so you have a clean revert point. | ||
|
|
||
| 6. **Benchmark.** `bash bench.sh > bench.log 2>&1`. Read bench.log for results. | ||
|
|
||
| 7. **Evaluate.** Extract the RESULT line for n=1048576. Compare lookup_ratio and insert_ratio to the previous best. | ||
|
|
||
| 8. **Decide.** | ||
| - If improved: log as "keep" in results.tsv | ||
| - If not improved or regressed: `git revert HEAD --no-edit` and log as "revert" in results.tsv | ||
|
|
||
| 9. **Log.** Append a line to `results.tsv`: | ||
| ``` | ||
| <commit_hash>\t<lookup_ratio_1M>\t<insert_ratio_1M>\t<delete_ratio_1M>\t<keep|revert|crash>\t<description> | ||
| ``` | ||
|
|
||
| 10. **Repeat.** Go to step 1. | ||
|
|
||
| ### Every 10 experiments: | ||
|
|
||
| Review results.tsv. Write a brief analysis to `insights.md`: what patterns are working, what's failing, what to try next. | ||
|
|
||
| ## Strategy hints (ordered by expected impact) | ||
|
|
||
| ### Quick wins | ||
| - **Prefetch optimization.** The current `@prefetch` in `get()` only prefetches the first probe. Prefetching the next tier's location while processing the current one could hide memory latency. | ||
| - **Tier search order.** Currently probes all tiers at each probe depth (j). Since tier 0 holds ~50% of elements, checking tier 0 more aggressively before touching other tiers could improve average case. | ||
| - **Early termination.** If a tier's bucket has empty slots (no fingerprint match AND empty slots present), the key can't be deeper in that tier. Stop probing that tier early. | ||
| - **Hash function.** The current wyhash variant might not distribute well for sequential integer keys. Try different mixing constants or a different hash entirely. | ||
|
|
||
| ### Architectural changes | ||
| - **Flatten hot tiers.** The first 2-3 tiers hold 87.5% of elements. Interleaving their memory could improve cache locality for the common case. | ||
| - **Bloom filter per tier.** A small bloom filter before each tier's probe could skip entire tiers when the key isn't there, avoiding cold memory accesses. | ||
| - **Cuckoo-style relocation.** After initial insertion, relocate elements to reduce probe depth for lookup. This trades insert time (which we have headroom on) for lookup time. | ||
| - **Two-level indexing.** A small top-level index that maps hash ranges to likely tiers, avoiding the scan-all-tiers pattern. | ||
|
|
||
| ### Deep changes | ||
| - **Alternative tier structure.** Instead of geometric halving, use a different size distribution tuned for cache line boundaries. | ||
| - **Robin Hood insertion.** Reorder elements during insertion to minimize worst-case probe depth, at the cost of insert speed. | ||
| - **Separate hot/cold paths.** Inline the tier-0 lookup (most common case) and call out to a cold function for deeper tiers. | ||
|
|
||
| ## Hardware context | ||
|
|
||
| This will run on a typical x86_64 Linux machine. Assume: | ||
| - L1 cache: 32-48 KB per core | ||
| - L2 cache: 256-512 KB per core | ||
| - L3 cache: shared, several MB | ||
| - Cache line: 64 bytes | ||
| - SIMD: SSE2/AVX2 available (Zig's @Vector will use what's available) | ||
|
|
||
| At 1M elements, the fingerprint array alone is ~128 KB (fits in L2 but not L1). Keys and values are each ~1 MB (spills to L3). Cache behavior dominates at this scale. | ||
|
|
||
| ## What NOT to do | ||
|
|
||
| - Don't add complexity that yields tiny improvements. A 0.01 improvement from 30 lines of code is not worth it. | ||
| - Don't "clean up" or refactor without measuring. Every change must be benchmarked. | ||
| - Don't change the public API signatures. | ||
| - Don't try to beat std.HashMap on insert — we already win 4-8x there. Focus entirely on lookup. | ||
| - Don't get conservative after finding improvements. Be bold. Try rewrites. 76% of experiments getting reverted is normal and healthy. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Experiment count mismatch.
The title says "58 experiments" but
results.mddocuments 62 experiments. Consider updating the title to match.📝 Suggested fix
📝 Committable suggestion
🤖 Prompt for AI Agents